Fraud detection in Credit Card Transactions¶

fraud_image_colab_ready.png

Introduction¶

In recent years, digital payment methods have become the backbone of the global economy, revolutionizing how individuals, businesses, and governments transfer value. Credit and debit cards now facilitate billions of transactions every single day, powering everything from online shopping and contactless payments at physical stores to subscription-based platforms like Netflix or Spotify.

Their integration into mobile wallets, wearable technology, and global e-commerce ecosystems has made them not just convenient - but essential for modern life, and as digital infrastructures expand, card-based payments continue to replace cash at an accelerating pace, becoming the default method of exchange for billions of people.

However, this rapid expansion does not come without risks. According to data from the Nilson Report (Issue 1276, 2023), global card fraud losses surpassed $34 billion in 2022, and are projected to exceed $49 billion by 2030. Other industry estimates suggest that a fraudulent transaction takes place approximately every 16 seconds somewhere around the world. Credit card fraud presents disturbing figures, which highlights the scale and persistence of the threat.

But how is it done?

Usually, cybercriminals and fraudsters are leveraging increasingly sophisticated methods, such as:

  • Card-not-present (CNP) fraud, which are common in online transactions

  • Phishing and social engineering to steal credentials

  • Malware and data breaches that expose millions of card numbers

  • Synthetic identity fraud, where fake identities are constructed to open credit accounts

These techniques have proven to be highly effective, and they allowed to bypass traditional security systems, exploit digital payment infrastructures and cause finanical harm on a massive scale. It is also important to mention that the attacks do not only result in direct monetary losses for consumers and finanical institutions, but they also primarily undermine the trust in digital commerece.

image.png

So, we know we are dealing with a genuinely concerning problem, but now we must ask the question - what ideas have we come up with to solve this issue?

To begin with, many banks and financial institutions have deployed transaction monitoring systems. They have also share their blacklists of compromised cards, fraud "hot lists", and data breach information with other banks and merchants. From a legal and regulatory standpoint, several government legislations have been enacted. Educational campaigns have also been introduced to encourage people not to share card details, be cautious of phishing emails, and regularly review their bank statements.

But as data scientists, our focus is on analyzing the technical capabilities of credit card fraud detection systems - particularly from a performance point of view.

For decades now, credit card fraud detection methodologies have relied heavily on rule-based systems - essentially hard-coded, static if-else logic defined by human experts. These systems followed the industry's best practices, typically flagging transactions that:

  • Exceeded a predefined monetary threshold

  • Originated from high-risk countries

  • Occurred at unusual times (e.g., late at night)

  • Were initiated from previously unseen devices

While these rules were simple to implement and easy to interpret, they consistently underperformed in complex, fast-evolving fraud environments.

How badly have they underperformed?

According to a 2024 study comparing rule-based and machine learning approaches to fraud detection (Sule et al., 2024), a survey of 150 financial institutions revealed that traditional rule-based systems catch only about 65-70% of fraudulent transactions. This demonstrates that such systems are limited to detecting known or obvious patterns and lack the flexibility to adapt to constantly evolving attack strategies.

These systems suffer from three main drawbacks:

  1. Lack of adaptability - The rules are reactive, not predictive. By the time a new rule is created, fraudsters have often changed tactics. Moreover, as the number of transactions grows into the millions per hour, maintaining, tuning, and updating these rules becomes increasingly impractical

  2. High false-positive rates - Many legitimate users are falsely flagged and blocked, leading to frustration and loss of revenue, and eventually, loss of reputation

  3. Inability to detect complex patterns - Fraud patterns often involve subtle feature interactions that rules cannot capture

Recognizing these limitations, we naturally turn to a modern solution that has shown great promise - Machine Learning and Deep Learning

Machine learning, and in recent years, deep learning - offers a fundamentally different paradigm. Instead of relying on hardcoded rules, these models learn patterns from data - whether linear trends, non-linear interactions, or hidden correlations.

When applied to credit card fraud detection, this approach provides several significant advantages over traditional rule-based systems:

  1. Pattern Recognition at scale:

    Machine learning algorithms excel at identifying subtle and complex behaviors across vast datasets. This allows them to detect not only known fraud signals, but also previously unseen patterns that rule-based systems would completely miss

  2. Real-time detection

    With the help of fast, lightweight models and advanced streaming techniques, machine learning - based systems can flag suspicious activity in near real-time. This enables finanical institutions to intervene before a fraudulent transaction is completed, thereby minimizing losses

  3. Adaptive Learning

    Unlike static rule engines, machine learning models can be continuously retrained on fresh data, allowing them to evolve alongside fraudster tactics. This makes them far more resilient against emerging threats such as synthetic identity fraud or bot-driven attacks

  4. Reduced False Positives

    One of the major limitations of rule-based systems is their high false-positive rate. By leveraging features like transaction time, location, amount, merchant category, device fingerprint and user behavior history, machine learning models can better distinguish between normal and suspicious activity - resulting in fewer legitimate transactions being blocked

In essence, machine learning and deep learning give fraud detection systems the ability to "think statistically", to learn from history, adapt to new trends, and respond to threats faster than humans can design new rules.

So... What's the catch?

While machine learning and deep learning offer powerful capabilities, applying them to credit card fraud detection is far from straightforward. These systems require large volumes of high-quality data, careful model design, and robust evaluation to be effective - and even then, several challenges persist. For instance:

  • Class Imbalance: in real-world scenarios, fradulent transactions are relatively rare compared to the total volume of transactions. This imbalance can cause models to favor the majority class (non-fraud) and overlook the minority class (fraud), leading to poor recall - that is, actual fraud cases being missed

  • Data privacy and availability: Accessing real-world transaction data is difficult due to strict confidentiality and compliance standards (e.g., GDPR, PCI DSS). This limits open research and model generalizability

  • False Positives vs. Customer Experience: A highly sensitive model may flag too many legitimate transactions as fraud, causing customer frustration, support overload and reputational damage. Striking the right balance between precision and recall is a constant challenge

To explore these challenges in a controlled environment, we turn to a publicly available dataset from Kaggle, which simulates real-world credit card transaction patterns. This dataset contains thousands of labeled transactions - some genuine, others fraudulent, and serves as the foundation for our fraud detection experiments.

Although it does not capture the full complexity of enterprise-scale financial systems, it enables us to simulate class imbalance, apply and compare various machine learning models, and evaluate the trade-offs between detection accuracy and false positive rates. By succeeding in this endeavor, we can gradually refine these models, making them increasingly robust and accurate over time.

Motivation¶

Credit card fraud is more than a technical anomaly - it's a global problem that directly affects the financial security and emotional well-being of millions of people. Behind every fraudulent transaction is a victim, someone whose trust was violated, whose savings may have been compromised, and who now faces a difficult and often bureaucratic process to reclaim what was lost.

The motivation for this project stems from our desire to contribute, even in a small way, to a safer digital ecosystem. As data science students and future data professionals, we believe we have a responsibility to harness machine learning not only for innovation, but also for protection. The growing accessibility of fraud tools on the dark web, the rise of AI-generated phishing attacks, and the sheer scale of financial losses each year all point to a troubling trend - one that requires urgent and ongoing attention.

By exploring how data-driven techniques can help detect and mitigate fraudulent behavior, we hope to highlight the positive and ethical role data science can play in safeguarding individuals and institutions alike.

smallest_combating_credit_card_fraud.png

Project Overview¶

This project aims to build an intelligent system capable of detecting fraudulent credit card transactions using machine learning techniques. We will approach this problem by leveraging the Credit Card Transactions dataset provided on Kaggle, which contains over 1.85 million anonymized records of credit card transactions. Each record includes a fraud label indicating whether the transaction is fradulent or not.

We will:

  • Analyze the distribution and characteristics of fraudulent vs. non-fraudulent transactions

  • Handle class imbalance, which is a core challenge in fraud detection

  • Train and compare multiple machine learning models, including neural networks for classification

  • Optimize the balance between recall (catching frauds) and precision (avoiding false alarms)

  • Evaluate models using real-world metrics such as recall, precision, F1-score, AUC-ROC

Later, we will summarize our findings, highlighting which models performed best under what conditions and discuss the trade-offs encountered during model tuning.


Information about the dataset:

You can access the dataset here:

👉 Kaggle - Credit Card Transactions

Each record in the dataset represents a single transaction and includes the following features:

  1. trans_date_trans_time - Exact date and time of the transaction

  2. cc_num - Credit card number used

  3. merchant - Merchant or vendor where the purchase took place

  4. category - Transaction category (e.g., groceries, entertainment, etc.)

  5. amt - Monetary amount of the transaction (USD)

  6. first - Cardholder's first name

  7. last - Cardholder's last name

  8. gender - Gender of the cardholder

  9. street - Street address of the cardholder

  10. city - City of the cardholder

  11. state - U.S. state (2-letter abbreviation)

  12. zip - ZIP code of the cardholder's address

  13. lat - Latitude coordinate of the cardholder's location

  14. long - Longitude coordinate of the cardholder's location

  15. city_pop - Population of the cardholder's city

  16. job - Cardholder's occupation

  17. dob - Date of birth of the cardholder

  18. trans_num - Unique transaction identifier

  19. unix_time - Transaction timestamp in UNIX format

  20. merch_lat - Latitude of the merchant's location

  21. merch_long - Longtitude of the merchant's location

  22. is_fraud - Target label (0 = genuine, 1 = fraudulent)

  23. unnamed: 0 - index column created during export (to be dropped)

Exploratory Data Analysis (EDA)¶

In this stage we will:

  • Analyze the dataset to understand its structure and feature distributions

  • Identify potential anomalies, outliers, or data quality issues

  • Use visualizations to uncover trends and relationships between features

  • Establish initial insights that will inform the preprocessing and modeling stages

Mount Google Drive¶

In [3]:
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Libraries¶

In [4]:
# Data Manipulation and Numerical Analysis
import pandas as pd
import numpy as np


#  Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


# Scikit-learn
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_validate
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    OneHotEncoder,
    FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier,
    HistGradientBoostingClassifier
)
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score
)
from sklearn.utils.class_weight import compute_class_weight
from sklearn.base import BaseEstimator, TransformerMixin


# Imbalanced Data Handling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

from statsmodels.stats.proportion import proportions_ztest
from sklearn.decomposition import PCA

import re

# Geographical
import folium
from folium.plugins import MarkerCluster

# Clustering
from sklearn.manifold import TSNE
import time
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Temporal
import matplotlib.ticker as mtick

# Traning Models
import torch
import random
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay


from sklearn.metrics import make_scorer
from IPython.display import clear_output
from xgboost import XGBClassifier


import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.base import ClassifierMixin
from sklearn.model_selection import train_test_split

Importing the dataset¶

In [5]:
df_train = pd.read_csv('/content/drive/MyDrive/kaggle/סדנה במדעי הנתונים - עומר ומקס/fraudTrain.csv')
df_test = pd.read_csv('/content/drive/MyDrive/kaggle/סדנה במדעי הנתונים - עומר ומקס/fraudTest.csv')

Structure of the dataset¶

In [6]:
df_train.head()
Out[6]:
Unnamed: 0 trans_date_trans_time cc_num merchant category amt first last gender street ... lat long city_pop job dob trans_num unix_time merch_lat merch_long is_fraud
0 0 2019-01-01 00:00:18 2703186189652095 fraud_Rippin, Kub and Mann misc_net 4.97 Jennifer Banks F 561 Perry Cove ... 36.0788 -81.1781 3495 Psychologist, counselling 1988-03-09 0b242abb623afc578575680df30655b9 1325376018 36.011293 -82.048315 0
1 1 2019-01-01 00:00:44 630423337322 fraud_Heller, Gutmann and Zieme grocery_pos 107.23 Stephanie Gill F 43039 Riley Greens Suite 393 ... 48.8878 -118.2105 149 Special educational needs teacher 1978-06-21 1f76529f8574734946361c461b024d99 1325376044 49.159047 -118.186462 0
2 2 2019-01-01 00:00:51 38859492057661 fraud_Lind-Buckridge entertainment 220.11 Edward Sanchez M 594 White Dale Suite 530 ... 42.1808 -112.2620 4154 Nature conservation officer 1962-01-19 a1a22d70485983eac12b5b88dad1cf95 1325376051 43.150704 -112.154481 0
3 3 2019-01-01 00:01:16 3534093764340240 fraud_Kutch, Hermiston and Farrell gas_transport 45.00 Jeremy White M 9443 Cynthia Court Apt. 038 ... 46.2306 -112.1138 1939 Patent attorney 1967-01-12 6b849c168bdad6f867558c3793159a81 1325376076 47.034331 -112.561071 0
4 4 2019-01-01 00:03:06 375534208663984 fraud_Keeling-Crist misc_pos 41.96 Tyler Garcia M 408 Bradley Rest ... 38.4207 -79.4629 99 Dance movement psychotherapist 1986-03-28 a41d7549acf90789359a9aa5346dcb46 1325376186 38.674999 -78.632459 0

5 rows × 23 columns

In [7]:
df_test.head()
Out[7]:
Unnamed: 0 trans_date_trans_time cc_num merchant category amt first last gender street ... lat long city_pop job dob trans_num unix_time merch_lat merch_long is_fraud
0 0 2020-06-21 12:14:25 2291163933867244 fraud_Kirlin and Sons personal_care 2.86 Jeff Elliott M 351 Darlene Green ... 33.9659 -80.9355 333497 Mechanical engineer 1968-03-19 2da90c7d74bd46a0caf3777415b3ebd3 1371816865 33.986391 -81.200714 0
1 1 2020-06-21 12:14:33 3573030041201292 fraud_Sporer-Keebler personal_care 29.84 Joanne Williams F 3638 Marsh Union ... 40.3207 -110.4360 302 Sales professional, IT 1990-01-17 324cc204407e99f51b0d6ca0055005e7 1371816873 39.450498 -109.960431 0
2 2 2020-06-21 12:14:53 3598215285024754 fraud_Swaniawski, Nitzsche and Welch health_fitness 41.28 Ashley Lopez F 9333 Valentine Point ... 40.6729 -73.5365 34496 Librarian, public 1970-10-21 c81755dbbbea9d5c77f094348a7579be 1371816893 40.495810 -74.196111 0
3 3 2020-06-21 12:15:15 3591919803438423 fraud_Haley Group misc_pos 60.05 Brian Williams M 32941 Krystal Mill Apt. 552 ... 28.5697 -80.8191 54767 Set designer 1987-07-25 2159175b9efe66dc301f149d3d5abf8c 1371816915 28.812398 -80.883061 0
4 4 2020-06-21 12:15:17 3526826139003047 fraud_Johnston-Casper travel 3.19 Nathan Massey M 5783 Evan Roads Apt. 465 ... 44.2529 -85.0170 1126 Furniture designer 1955-07-06 57ff021bd3f328f8738bb535c302a31b 1371816917 44.959148 -85.884734 0

5 rows × 23 columns

In [8]:
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long                   1296675 non-null  float64
 15  city_pop               1296675 non-null  int64  
 16  job                    1296675 non-null  object 
 17  dob                    1296675 non-null  object 
 18  trans_num              1296675 non-null  object 
 19  unix_time              1296675 non-null  int64  
 20  merch_lat              1296675 non-null  float64
 21  merch_long             1296675 non-null  float64
 22  is_fraud               1296675 non-null  int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 227.5+ MB
In [9]:
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  float64
 14  long                   555719 non-null  float64
 15  city_pop               555719 non-null  int64  
 16  job                    555719 non-null  object 
 17  dob                    555719 non-null  object 
 18  trans_num              555719 non-null  object 
 19  unix_time              555719 non-null  int64  
 20  merch_lat              555719 non-null  float64
 21  merch_long             555719 non-null  float64
 22  is_fraud               555719 non-null  int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 97.5+ MB

First impression:

After inspecting the dataset, several initial insights can be drawn:

  • Both training and test sets share a consistent schema - 23 columns, identical names, and matching data types

  • There are no missing values across any columns, which simplifies preprocessing and ensures data integrity

  • The dataset combines spatial, temporal and demographic information, including customer and merchant coordinates, timestamps, job titles and transaction amounts - an excellent basis for behavioral and anomaly-based fraud detection

  • Several columns such as merchant, city, job, and category are categorical (object dtype) and will require proper encoding to prevent overfitting and handle high cardinality.

  • The is_fraud column is included in both datasets, providing clear binary labels for supervised learning.

Overall, the dataset is well-structured, comprehensive, and realistic, which provides a strong foundation for the modeling phase.


Dataset Integrity and Scale Verification

Before moving further into feature-level exploration, we will validate the internal scale and realism of the dataset to ensure consistency with the reported simulation design:

Full Dataset:

In [10]:
df_full = pd.concat([df_train, df_test], axis=0)
unique_counts = df_full.nunique()
print(unique_counts)
Unnamed: 0               1296675
trans_date_trans_time    1819551
cc_num                       999
merchant                     693
category                      14
amt                        60616
first                        355
last                         486
gender                         2
street                       999
city                         906
state                         51
zip                          985
lat                          983
long                         983
city_pop                     891
job                          497
dob                          984
trans_num                1852394
unix_time                1819583
merch_lat                1754157
merch_long               1809753
is_fraud                       2
dtype: int64

Train set:

In [11]:
print(f"Total transations (train): {len(df_train):,}")
print(f"Unique cardholders (cc_num): {df_train['cc_num'].nunique():,}")
print(f"Unique merchants: {df_train['merchant'].nunique():,}")
print(f"Unique categories: {df_train['category'].nunique():,}")
print(f"Transaction date range: {df_train['trans_date_trans_time'].min()} -> {df_train['trans_date_trans_time'].max()}")
Total transations (train): 1,296,675
Unique cardholders (cc_num): 983
Unique merchants: 693
Unique categories: 14
Transaction date range: 2019-01-01 00:00:18 -> 2020-06-21 12:13:37
In [12]:
print(f"Merchants: {df_train['merchant'].head(5)}")
print(f"Categories: {df_train['category'].unique()}")
Merchants: 0            fraud_Rippin, Kub and Mann
1       fraud_Heller, Gutmann and Zieme
2                  fraud_Lind-Buckridge
3    fraud_Kutch, Hermiston and Farrell
4                   fraud_Keeling-Crist
Name: merchant, dtype: object
Categories: ['misc_net' 'grocery_pos' 'entertainment' 'gas_transport' 'misc_pos'
 'grocery_net' 'shopping_net' 'shopping_pos' 'food_dining' 'personal_care'
 'health_fitness' 'travel' 'kids_pets' 'home']

Test set:

In [13]:
print(f"Total transations (test): {len(df_test):,}")
print(f"Unique cardholders (cc_num): {df_test['cc_num'].nunique():,}")
print(f"Unique merchants: {df_test['merchant'].nunique():,}")
print(f"Unique categories: {df_test['category'].nunique():,}")
print(f"Transaction date range: {df_test['trans_date_trans_time'].min()} -> {df_test['trans_date_trans_time'].max()}")
Total transations (test): 555,719
Unique cardholders (cc_num): 924
Unique merchants: 693
Unique categories: 14
Transaction date range: 2020-06-21 12:14:25 -> 2020-12-31 23:59:34
In [14]:
print(f"Merchants: {df_test['merchant'].head(5)}")
print(f"Categories: {df_test['category'].unique()}")
Merchants: 0                   fraud_Kirlin and Sons
1                    fraud_Sporer-Keebler
2    fraud_Swaniawski, Nitzsche and Welch
3                       fraud_Haley Group
4                   fraud_Johnston-Casper
Name: merchant, dtype: object
Categories: ['personal_care' 'health_fitness' 'misc_pos' 'travel' 'kids_pets'
 'shopping_pos' 'food_dining' 'home' 'entertainment' 'shopping_net'
 'misc_net' 'grocery_pos' 'gas_transport' 'grocery_net']

Explanation:

As was mentioned earlier, the dataset contains over 1.85 million transactions, seperated into 2 sets - the training set, with approximately 1.3 million transactions (≈70% of the total data) and the test set, with 555k transactions (≈30% of the whole data).

Together, they represent activity generated across 999 unique credit cards and 693 merchants, aligning closely with the reported simulation parameters of roughly 1,000 customers and 800 merchants (mentioned in Kaggle). These transactions are distributed among 14 merchant categories, covering everyday spending areas such as grocery, fuel, dining, shopping and travel - providing a balanced and realistic view of consumer behavior.

The temporal coverage extends from January 2019 to December 2020, with the training data spanning January 2019 to June 2020, and the test data continuing from June 2020 to December 2020. This continuous timeline captures approximately 2 years of transactional activity, sufficient to observe seasonal effects, behavioral variations and potential drift over time.

These findings confirm that the dataset's internal structure and temporal design are coherent, consistent and credible, faithfully representing the intended simulation logic rather than arbitrary synthetic data. While its overall size is modest compared to real-world credit card systems (which handle thousands of transactions per second), it is behaviorally representative, making it highly suitable for developing and validating fraud-detection models focused on transaction-level behavioral patterns rather than large-scale throughput.

Feature Cleanup and Selection

Before diving into deeper exploration, we perform some initial feature cleanup to reduce redundancy and remove non-informative columns

Columns to drop

  1. Unnamed:0 - Index column generated during CSV export (not useful for modeling)

  2. first and last - Names are non-predictive and irrelevant to fraud behavior

  3. trans_num - Unique transaction identifier; does not contribute to predictive patterns

In [15]:
drop = ['Unnamed: 0', 'first', 'last', 'trans_num']
for i in drop:
  df_train.drop(columns=i, inplace=True)
  df_test.drop(columns=i, inplace=True)

Handling Redundant Temporal Columns

Both trans_date_trans_time and unix_time encode the transaction timestamp. Since they represent the same information, we can safely drop one to avoid redundancy. We retain trans_date_trans_time because it offers direct interpretability and allows for the extraction of meaningful temporal features such as:

  • Hour of the day (to capture time-of-day spending patterns)

  • Day of the week (to identify weekday vs. weekend behaviors)

  • Month and year (for seasonal and long-term trend analysis)

In contrast, unix_time represents the same information as a continuous integer timestamp (seconds since the Unix epoch), which lacks immediate interpretability and cannot directly provide calendar-based insights without conversion.

In [16]:
df_train[['unix_time', 'trans_date_trans_time']].head(5)
Out[16]:
unix_time trans_date_trans_time
0 1325376018 2019-01-01 00:00:18
1 1325376044 2019-01-01 00:00:44
2 1325376051 2019-01-01 00:00:51
3 1325376076 2019-01-01 00:01:16
4 1325376186 2019-01-01 00:03:06
In [17]:
df_train = df_train.drop(columns=['unix_time'], errors='ignore')
df_test = df_test.drop(columns=['unix_time'], errors='ignore')

Duplicate Check

In [18]:
print(f"Number of duplicates rows (training set): {df_train.duplicated().sum()}")
print(f"Number of duplicates rows (test set): {df_test.duplicated().sum()}")
Number of duplicates rows (training set): 0
Number of duplicates rows (test set): 0

Overview Summary

  • Schema Match: Train and test sets align perfectly

  • Labels: The is_fraud label is binary and complete

  • Data Quality: No missing values; data types are consistent

  • Volume split: ~70% train / 30% test - a standard and reasonable proportion.

Now that the dataset's integrity and structure have been verified, we can confidently proceed to exploratory visualization and feature-level analysis, where each variable will be examined both numerically and visually to uncover patterns, outliers, and potential predictive signals indicative of fraudulent behavior

Features¶

🕘 trans_date_trans_time¶

Data Integrity¶

The trans_date_trans_time feature shows complete and consistent values with no missing or anomalous entries.

Duplicate timestamps are expected since multiple transactions can occur at the same moment, and all recorded timestamps fall within the valid range of January 2019 - December 2020.

To facilitate time-based analysis, we derived the following temporal features from trans_date_trans_time feature:

In [19]:
# Apply datetime conversion and feature extraction for both datasets
for df in [df_train, df_test]:
  df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
  df['hour'] = df['trans_date_trans_time'].dt.hour
  df['day_of_week'] = df['trans_date_trans_time'].dt.day_name()
  df['month'] = df['trans_date_trans_time'].dt.month
  df['year'] = df['trans_date_trans_time'].dt.year
  df['date'] = df['trans_date_trans_time'].dt.date
In [20]:
for name, df in [('df_train', df_train), ('df_test', df_test)]:
  print(f"\nNumber of unique values in temporal features of {name}:")
  for col in ['hour', 'day_of_week', 'month', 'year']:
    print(f"{col} unique values: {df[col].nunique()}")
Number of unique values in temporal features of df_train:
hour unique values: 24
day_of_week unique values: 7
month unique values: 12
year unique values: 2

Number of unique values in temporal features of df_test:
hour unique values: 24
day_of_week unique values: 7
month unique values: 7
year unique values: 1

These results confirm that all temporal components were correctly extracted. We also examined df_test, which is something we normally avoid to prevent data leakage. However, in this case, the inspection focused solely on basic structural integrity, not on any patterns or distributions related to the target variable (is_fraud).

This validation step was necessary, because we added engineered temporal features to df_test, ensuring they are properly structured for models evaluation, and remain consistent with the feature set used to train the models on df_train. These features will replace the direct reliance on the original trans_date_trans_time column.

In [21]:
df_train = df_train.drop(columns=['trans_date_trans_time'], errors='ignore')
df_test = df_test.drop(columns=['trans_date_trans_time'], errors='ignore')

Overall, the trans_date_trans_time feature and its derived temporal components demonstrate excellent data integrity. We are now ready to explore the derived features and analyze interesting patterns in the data:

Transaction Volume over Time¶
In [22]:
daily_txn = df_train.groupby('date').size()

plt.figure(figsize=(12,6))
daily_txn.plot(kind='line', lw=1.5)
plt.title("Daily Transaction Volume Over Time")
plt.xlabel("Date")
plt.ylabel("Number of Transactions")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Graph 1 - Daily Transaction Volume Over Time

The figure above illustrates the daily transaction counts between January 2019 and June 2020.

A consistent weekly cyclic pattern is visble, reflecting regular consumer activity. Transaction volumes rise gradually through 2019, stabilizing around 3,500 - 4,000 transactions per day, before spiking sharply to ≈ 6,000 transactions/day in late 2019. At the start of 2020, the volume drops to around 2,500 - 3,000 transactions/day, where it remains steady throughout the following months.

These structural shifts could be due to changes in simulated data generation, seasonal shopping cycles, or economic variations represented in the synthetic dataset. The strong periodic peaks and troughs likely correspond to weekly purchasing rhythms, which will be explored further in the weekday and hourly analyses that follow

💡 Interpretation Note

Interestingly, the sharp decline in early 2020 coincides with the real-world emergence of COVID-19, which may have been implicitly reflected in the simulator's generation logic. Even if unintentional, the pattern aligns with actual global spending slowdowns, providing a plausible, interpretable shift within the dataset's temporal structure.

This correspondence may help explain the reason for the sharp decline in transaction volume observed at the beginning of 2020

Transactions per Hour¶
In [23]:
hourly_txn = (
    df_train.groupby('hour')
    .size()
    .reset_index(name="count")
)

sns.barplot(data=hourly_txn, x='hour', y='count')
plt.title("Transactions per Hour of Day")
plt.xlabel("Hour of Day (0-23)")
plt.ylabel("Number of transactions")
plt.show()
No description has been provided for this image

Graph 2 - Transactions per Hour of Day

The distribution of transactions across hours of the day reveals two distinct activity zones:

  1. Hours 0-11 (midnight to late morning): transaction volumes remain relatively stable at around ~42K transactions per hour

  2. Hours 12-23 (afternoon to midnight): A pronounced surge occurs, with volumes increasing sharply to around ~65K transactions per hour

This pattern indicates that most transactional activity occurs in the second half of the day, after noon.

The rise after 12:00 corresponds to typical consumer behavior, with increased purchasing during lunch breaks, afternoon shopping, and evening leisure or online spending.

For fraud detection, the hour-of-day feature is likely highly informative. Transaction density is far from uniform throughout the day, meaning unusual timing (e.g., very late night activity) may signal suspicious behavior.

💡 Interpretation Note

Interestingly, the sharp transition around midday could also stem from batch-based data generation or transaction posting delays, which are common in financial systems that process transactions in grouped cycles.

In a real-world context, this midday spike could reflect the combined effect of time zone overlaps (e.g., East Coast and West Coast transaction synchronization) or increased digital activity as users engage more with online platforms after work hours.

These hourly variations may later help us design more advanced temporal features (like is_night or is_weekend) that encode typical behavioral rhythms into our model

Transactions per Weekday¶
In [24]:
# Weekday analysis
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

avg_by_day = (
    df_train.groupby('day_of_week')
    .size()
    .groupby(level=0)
    .mean()
    .reindex(order)
)

sns.barplot(
    data=avg_by_day.reset_index(),
    x='day_of_week',
    y=0,
    order=order
)

plt.title("Average Transactions per Weekday")
plt.ylabel("Average Daily Transactions")
plt.xlabel("")
plt.show()
No description has been provided for this image

Graph 3 - Average Transactions per Weekday

The bar chart reveals a clear weekly seasonality in transaction activity. This analysis is based on the average number of daily transactions, since weekdays occur unevenly throughout the dataset (some months contain more Mondays or Fridays than others). Transaction volumes peak on Mondays and Sundays, suggesting higher consumer spending at the start and end of each week

Mid-week days (Wednesday through Friday) show a noticeable decline in activity, while Saturday sits in an intermediate range - higher than most weekdays but below the two peak days.

This pattern reflects typical consumer behavior: spending often increases during weekends and early in the week when individuals complete online purchases or handle routine payments after the weekend.

Transactions per Month¶
In [25]:
monthly_txn = (
    df_train.groupby('month')
    .size()
    .reset_index(name="count")
)

sns.barplot(data=monthly_txn, x='month', y='count')
plt.title("Transactions per Month")
plt.xlabel("Months (1-12)")
plt.ylabel("Number of transactions")
plt.show()
No description has been provided for this image

Graph 4 - Transactions per Month

The bar chart illustrates the distribution of transactions across the twelve months of the year. A clear seasonal trend is visible, showing fluctuations in consumer activity throughout the year.

Transaction volumes gradually increase from January, reaching their highest levels during April to June, with May standing out as the peak month of activity. Following this mid-year high, there is a noticeable decline from July to October, indicating a period of reduced consumer spending.

Toward the end of the year, transaction counts rise again in December, reflecting a holiday-related surge in purchases, a common seasonal pattern in financial transaction data.

This monthly trend suggests that the dataset captures realistic seasonal consumer behavior, where higher transaction volumes correspond to known spending periods, such as spring and end-of-year shopping cycles.

Fraudulent transactions¶

Let us now analyze the rate of fraudulent transactions, based on the months, days and hours:

In [26]:
fig, axes = plt.subplots(3, 1, figsize=(12, 18))

# --- Graph 1: Fraud count by month ---
fraud_by_month = df_train.groupby('month')['is_fraud'].sum()
axes[0].plot(fraud_by_month.index, fraud_by_month.values, marker='o')
axes[0].set_title("Fraud Count by Month")
axes[0].set_ylabel("Fraud Count")
axes[0].set_xlabel("")

# --- Graph 2: Fraud count by weekday ---
order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
fraud_by_day = df_train.groupby('day_of_week')['is_fraud'].sum().reindex(order)
sns.barplot(x=fraud_by_day.index, y=fraud_by_day.values, ax=axes[1], order=order)
axes[1].set_title("Fraud Count by Weekday")
axes[1].set_ylabel("Fraud Count")
axes[1].set_xlabel("")

# --- Graph 3: Fraud count by hour ---
fraud_by_hour = df_train.groupby('hour')['is_fraud'].sum()
sns.barplot(x=fraud_by_hour.index, y=fraud_by_hour.values, ax=axes[2])
axes[2].set_title("Fraud Count by Hour of Day")
axes[2].set_ylabel("Fraud Count")
axes[2].set_xlabel("Hour (0–23)")

plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 5 - Fraudulent transactions by Month, Weekday and Hour

The temporal distribution of fraudulent transactions reveals several behavioral patterns that align closely with normal consumer activity - a likely attempt by fraudsters to blend in and avoid detection

  • Monthly: Fraud rates vary noticeably throughout the year. Activity peaks around March - May, then drops sharply through July-October, before slightly rising again in December. This coincides with the highest overall tranasction volumes in Graph 4, which suggests that fraudsters may deliberately exploit periods of intense consumer activity, when their actions can be more easily concealed among a large number of legitimate transactions.

  • Weekly: Fraud occurs across all days of the week but is most frequent on weekends (Saturday-Sunday) and Mondays. This mirrors the pattern of regular transaction activity, reinforcing the idea that attackers intentionally target high-traffic periods when monitoring may be less strict or slower to respond.

  • Hourly: Fraud is concentrated late at night, especially between 21:00 and 03:00, far outside normal consumer behavior peaks.

These patterns indicate that fraudsters strategically time their actions to mimic legitimate behavior - taking advantange of busy transaction periods and low-monitoring hours. Consequently, time-based features such as hour, day_of_week and month provide strong predictive signals for fraud detection models, helping to distinguish genuine consumer activity from subtle fraudulent behavior.

Overall, the trans_date_trans_time derived useful temporal components that demonstrate excellent data integrity and reveal clear and realistic behavioral patterns in transaction activity. These patterns, both for legitimate and fraudulent transactions, confirm that time-based behavior plays a key role in distinguishing between normal and suspicious activity.

Having established clear temporal behavior patterns, we now proceed to analyze card-level activity (cc_num) to explore how transaction frequency and fraud concentration vary across cardholders


💳 cc_num¶

Data Integrity¶

In the previous sections, we confirmed that the training set contains no missing values, including the cc_num feature, and identified 983 unique cardholders - consistent with the dataset's design.

At this point, we verify whether there are any unrealistic duplicate transactions, where the same card number, timestamps, and transaction amount appear together - a potential indicator of synthetic duplication or data leakage

In [27]:
dup_card_rows = df_train.duplicated(subset=['cc_num', 'year', 'month', 'day_of_week', 'hour', 'amt']).sum()
print(f"Duplicate transactions (same card, time, and amount): {dup_card_rows}")
Duplicate transactions (same card, time, and amount): 93

We can see that there are 93 identified duplicate records. While it is not a large amount of records (≈0.005% of all transactions), it is worth checking whether these transactions are anomalious or standard, realistic transactions:

In [28]:
dup_cards = (
    df_train[df_train.duplicated(subset=['cc_num', 'year', 'month', 'day_of_week', 'hour', 'amt'], keep=False)]
    .groupby('cc_num')
    .size()
    .reset_index(name="duplicate_count")
    .sort_values('duplicate_count', ascending=False)
)

dup_cards.head(10)
Out[28]:
cc_num duplicate_count
1 571365235126 4
5 4464457352619 4
6 4585132874641 4
8 30270432095985 4
11 30561214688470 4
40 3531129874770000 4
79 4536996888716062123 4
82 4956828990005111019 4
45 3553629419254918 4
23 341546199006537 4

We observe that the top 10 credit cards each exhibit four repeated transaction patterns. What stands out is that this duplication is highly structured rather than random - every card repeats exactly four times, a level of uniformity that seems too precise to occur by chance. In a noisy or corrupted dataset, we would expect varying repetition counts across cards. Therefore, this pattern likely reflects an intentional design or simulation effect rather than accidental duplication.

One possible explanation could be recurring legitimate payments, where cardholders repeatedly pay the same bill or subscription under similar conditions. However, this seems improbable, because the repetition is too consistent and limited in scope. If these were genuine recurring payments, we would expect a wider distribution of repetition frequencies and more extensive recurrence over time.

Let us observe what is the fraud rate among these duplicates:

In [29]:
df_train[df_train['cc_num'].isin(dup_cards['cc_num'])]['is_fraud'].value_counts(normalize=True)
Out[29]:
proportion
is_fraud
0 0.9962
1 0.0038

The fraud rate among the duplicated transactions is extremely low, indicating that these repetitions are almost entirely non-fraudulent. This suggests that the duplicate patterns are not the result of malicious activity, but rather a byproduct of the simulation process or recurring legitimate-like behavior within the synthetic data.

In other words, while the duplication pattern is unusually structured, it does not correspond to elevated fraud risk and therefore does not compromise data integrity. Instead, it provides a minor but realistic layer of transaction redundancy, consistent with real-world payment systems where repeated or batched transactions occasionally occur. Therefore, we won't drop or modify the duplicates, and leave them as they are

Next, it is critical to check whether there is train-test overlap, which can lead to data leakage.

If the same credit card appears in both the training and test sets, a model might memorize a card's historical behavior instead of learning generalizable fraud patterns

In [30]:
train_cards = set(df_train['cc_num'].unique())
test_cards = set(df_test['cc_num'].unique())

overlap = len(train_cards & test_cards)
print(f"Cards appearing in both train and test: {overlap} / {len(test_cards)} ({overlap/len(test_cards):.2%})")
Cards appearing in both train and test: 908 / 924 (98.27%)

We observe that 908 out of 924 cards (≈ 98.3%) appear in both the training and test datasets.

This indicates that the dataset was split by transaction rather than by cardholder - meaning the same credit card can appear in both sets

While this design is valid for transactional modeling, it introduces a potential information leakage risk. A model might memorize individual card behavior instead of learning general fraud patterns.

To mitigate this, during modeling we should consider:

  • Using grouped cross-validation by cc_num, ensuring that all transactions of a given card remain in the same fold

  • Evaluating the model both with and without card-level features to the generalization capability

💡 Important Remark:

Although the overlap raises leakage concerns, it also opens the door for meaningful feature engineering. In real-world banking systems, institutions track and share historical information about cardholders. Inspired by this, we could later engineer a feature such as history_of_fraud - indicating whether a card has previously been involved in fraudulent activity.

Such a feature would emulate real fraud prevention mechanisms, where past behavior informs current risk, allowing the model to better identify high-risk cards while maintaining realistic, and ethical modeling practices.

Transaction Frequency per Card¶

Although card numbers serve primarily as identifiers, their activity levels can reveal behavioral patterns.

Here, we examine how many transactions each card performs:

In [31]:
df_train['cc_num'].value_counts().describe()
Out[31]:
count
count 983.000000
mean 1319.099695
std 812.235900
min 7.000000
25% 525.000000
50% 1054.000000
75% 2025.000000
max 3123.000000

The transaction distribution per card is highly uneven:

  • Minimum: 7 transactions
  • Maximum: 3,123 transactions
  • Median: 1,054 transactions
  • Mean: ~1319
  • Standard deviation: 812

This indicates that a small subset of high-activity cards contributes disproportionately to the total transaction volume - a common phenomenon in real financial datasets where some customers transact more frequently (e.g., business accounts, recurring payments).

For modeling, this implies that cc_num may introduce bias or overfitting if the model memorizes specific card patterns rather than learning generalized fraud behaviors

Outlier Exploration¶

Since cc_num is categorical (an ID), it can't have numeric outliers - but its behavioral characteristics can.

We therefore define "outliers" as cards that display:

  • Unusually high or low transactions counts (activity outliers)

  • Atypical fraud ratios compared to the general population

  • Abnormal mean transaction amounts

We identify behavioral outliers using the IQR method applied to transaction counts per card:

In [32]:
txn_per_card = df_train.groupby('cc_num')['cc_num'].count().rename('txn_count')

Q1 = txn_per_card.quantile(0.25)
Q3 = txn_per_card.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outlier_cards = txn_per_card[(txn_per_card < lower_bound) | (txn_per_card > upper_bound)]
print(f"Number of outlier cards: {len(outlier_cards)}")
Number of outlier cards: 0
In [33]:
plt.figure(figsize=(10,5))
sns.histplot(txn_per_card, bins=50, kde=True)
plt.axvline(upper_bound, color='red', linestyle='--', label='Upper Outlier Threshold')
plt.axvline(lower_bound, color='red', linestyle='--')
plt.title("Distribution of Transactions per Card - Identifying Outliers")
plt.xlabel("Number of Transactions per Card")
plt.ylabel("Count of Cards")
plt.legend()
plt.show()
No description has been provided for this image

Graph 6 - Distribution of Transactions per Card (Outlier Detection)

The histogram above displays the number of transactions made by each credit card (cc_num). The red dashed lines represent the calculated lower and upper statistical thresholds (1.5 x IQR rule).

The distribution follows a multi-modal and right-skewed pattern, showing distinct clusters of cards around specific transaction ranges (roughly 500, 1000, 1500, 2000 and 3000). This clustering is typical for simulated transactional datasets, where user behavior is generated across several predefined activity levels - for example, light, moderate, and heavy spenders.

Notably, no cards fall beyond the upper outlier threshold, meaning all cardholders exhibit realistic transaction volumes. Even the most active cards (≈3000 transactions) remain within expected behavioral limits.

The absence of statistical outliers confirms that the cc_num feature demonstrates consistent and credible transaction patterns, without evidence of synthetic bias or extreme anomalies.

The next section further examines how fraudulent activity is distributed across these cards, helping us understand whether certain cardholders are more exposed than others

Fraud Distribution and Concentration¶

Let us assess whether fraudulent activity is concentrated within a few cards or spread broadly across all cardholders:

In [34]:
fraud_counts_per_card = df_train.groupby('cc_num')['is_fraud'].sum()

plt.figure(figsize=(10,5))
sns.histplot(fraud_counts_per_card, bins=50, log=True)
plt.title("Number of cards per count of fraudulent transactions")
plt.xlabel("Number of fraudulent transactions")
plt.ylabel("Number of cards")
plt.show()
No description has been provided for this image

Graph 7 - Number of Cards per count of fraudulent transactions

The histogram shows that most cards experience few or no fraudulent transactions, typically between 5-15 per card, while only a handful of cards show significantly higher fraud counts.

This pattern suggests that fraud is widespread but low-intensity, resembling random attack behavior rather than repeated targeting of specific accounts

Fraud Exposure Rate

Finally, we check how many cards experienced at least one fraudulent transaction

In [35]:
fraud_cards = df_train[df_train['is_fraud'] == 1]['cc_num'].nunique()
total_cards = df_train['cc_num'].nunique()
print(f"{fraud_cards}/{total_cards} cards (~{fraud_cards/total_cards:.2%}) had at least one fraud")
762/983 cards (~77.52%) had at least one fraud

💡 Interpretation:

Around 77.5% of all cards (762 out of 983) experienced at least one fraud event. This confirms that fraud is not limited to a small subset of users, but rather distributed across the dataset - consistent with the simulator's goal to represent broad, population-level fraud exposure.

As a result, transaction-level behavioral features (such as time, amount, and merchant category) will be more effective than card identifiers themselves for detecting fraud patterns

The cc_num feature is clean, internally consistent and behaviorally informative. It forms a solid basis for aggregation-based features (e.g., fraud rate per card, average transaction frequency), though care must be taken to prevent overfitting in models that might memorize specific card behaviors.

Having established that, we now move to the merchant feature, which represents the point of transaction.


🛒 merchant¶

Data Integrity¶

The merchant feature represents the vendor or business where each transaction occurred. Earlier training set checks confirmed that this feature has 693 unique values and no missing entries, ensuring completeness.

However, completeness alone is insufficient - we must verify that the data is authentic, semantically consistent, and free from artificial duplication or irregularities.

Upon closer inspection, every merchant name follows a structured and realistic pattern, such as:

In [36]:
df_train['merchant'].sample(10)
Out[36]:
merchant
469158 fraud_Cummings LLC
886353 fraud_Bernhard Inc
715794 fraud_Murray-Smitham
1045963 fraud_Haag-Blanda
1029663 fraud_Gutmann, McLaughlin and Wiza
571553 fraud_Kling Inc
581164 fraud_Berge-Hills
1035728 fraud_Boyer PLC
749136 fraud_Sporer-Keebler
530846 fraud_Witting, Beer and Ernser

All merchant names begin with the "fraud_" prefix, followed by a synthetic business name composed of one or two surnames and an optional corporate suffix (Inc, LLC, Ltd, PLC, Group, and Sons, etc.).

To confirm the origin of this structure, we referenced the dataset's official description on Kaggle, which explicitly explain the generation process:

"The simulator has certain pre-defined lists of merchants, customers, and transaction categories. Using the Python library 'faker', and the number of customers and merchants you specify, an intermediate list is created. Transactions are then simulated according to behavioral profiles (e.g., 'adult females 25-50 rural') with defined transaction frequencies and amount distributions."

-Kartik2112, Kaggle Dataset: Credit Card Fraud Detection (Sparkov Simulator)

This confirms that merchant names were generated using the faker library, ensuring structural realism while remaining fully synthetic. Each name behaves like a legitimate business identifier, even though it was programmatically generated.

A quick check confirms that 100% of merchant entries begin with the "fraud_" prefix:

In [37]:
prefix_check = df_train['merchant'].str.startswith('fraud_').mean()
print(f"Percentage of merchants starting with 'fraud_: {prefix_check:.2%}")
Percentage of merchants starting with 'fraud_: 100.00%

Thus, the merchant feature exhibits uniform structure, synthetic consistency, and no semantic leakage of fraud-related meaning from its textual content.

To verify that merchant names do not implicitly encode fraud related information, we examine whether naming patterns (e.g., suffixes) correlate with fraud likelihood

In [38]:
global_fraud_rate = df_train['is_fraud'].mean()
print(f"Global fraud rate: {global_fraud_rate:.2%}")

df_train['suffix'] = df_train['merchant'].str.extract(r'(LLC|Group|Inc|and Sons|Ltd|PLC)', expand=False)
fraud_by_suffix = df_train.groupby('suffix')['is_fraud'].mean().sort_values(ascending=False)
print(fraud_by_suffix)
Global fraud rate: 0.58%
suffix
Inc         0.007364
PLC         0.006649
and Sons    0.005264
Ltd         0.005236
LLC         0.005103
Group       0.004678
Name: is_fraud, dtype: float64

The global fraud rate in the dataset is 0.58%, and the fraud ratios across all major merchant suffixes remain tightly clustered around this baseline - from 0.46% to 0.73%.

This minimal deviation (≈ ± 0.0015 in absolute terms) indicates that these fluctuations are statistically negligible and fall well within the range of normal sampling variation.

Therefore, no suffix category exhibits a disproportionately high fraud rate, confirming that merchant names do not encode or correlate with fraudulent behavior.

Merchant Distributions and Outliers¶

Next, we examine merchant-level transaction and fraud distributions to identify potential outliers or irregular concentration of fraud:

In [39]:
txn_per_merchant = df_train['merchant'].value_counts()
fraud_per_merchant = df_train.groupby('merchant')['is_fraud'].sum().sort_values(ascending=False)

display(txn_per_merchant.head(10)) # Top 10 merchants by transaction volume
display(fraud_per_merchant.head(10)) # Top 10 merchants by total fraud
count
merchant
fraud_Kilback LLC 4403
fraud_Cormier LLC 3649
fraud_Schumm PLC 3634
fraud_Kuhn LLC 3510
fraud_Boyer PLC 3493
fraud_Dickinson Ltd 3434
fraud_Cummerata-Jones 2736
fraud_Kutch LLC 2734
fraud_Olson, Becker and Koch 2723
fraud_Stroman, Hudson and Erdman 2721

is_fraud
merchant
fraud_Rau and Sons 49
fraud_Cormier LLC 48
fraud_Kozey-Boehm 48
fraud_Kilback LLC 47
fraud_Doyle Ltd 47
fraud_Vandervort-Funk 47
fraud_Kuhn LLC 44
fraud_Padberg-Welch 44
fraud_Terry-Huel 43
fraud_Jast Ltd 42

The top merchants by volume process between 3,000 - 4,400 transaction each, consistent with high-traffic businesses. Similarly, the top merchants by fraud counts correspond to these same high-volume entities, indicating that fraud frequency scales with activity, not with merchant identity.

In [40]:
top_merchants = pd.DataFrame({
    'Transactions': txn_per_merchant.head(10),
    'Fraud_Counts': fraud_per_merchant.head(10)
}).fillna(0)

top_merchants.plot(kind='bar', figsize=(10,5))
plt.title('Top 10 Merchants by Transactions vs. Fraud Counts')
plt.ylabel('Count')
plt.xlabel('Merchant')
plt.xticks(rotation=45, ha='right')
plt.show()
No description has been provided for this image

Graph 8 - Top Merchants by Transactions vs. Fraud Counts

The chart illustrates total transaction volume (blue) and total fraud count (orange) for the 10 most active merchants.

While the most active merchants naturally exhibit more fraud events, the ratio of fraud-to-total transactions remains stable across all entities. This demonstrates that fraud is proportionally distributed across the network rather than concentrated in specific merchants.

To ensure statistical validity, we'll apply an IQR-based outlier check on merchant transaction volumes:

In [41]:
Q1 = txn_per_merchant.quantile(0.25)
Q3 = txn_per_merchant.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outlier_merchants = txn_per_merchant[
    (txn_per_merchant < lower_bound) | (txn_per_merchant > upper_bound)
]
print(f"Number of outlier merchants: {len(outlier_merchants)}")
Number of outlier merchants: 5
In [42]:
plt.figure(figsize=(10,5))
sns.histplot(txn_per_merchant, bins=50, kde=True)
plt.axvline(upper_bound, color='red', linestyle='--', label='Upper Outlier Threshold')
plt.axvline(lower_bound, color='red', linestyle='--')
plt.title("Distribution of Transactions per Merchant - Identifying Outliers")
plt.xlabel("Number of Transactions per Merchant")
plt.ylabel("Count of Merchants")
plt.legend()
plt.show()
No description has been provided for this image

Graph 9 - Distribution of Transactions per Merchant (Outlier Detection)

The histogram above illustrates the distribution of transaction volumes per merchant, with the red dashed lines marking the lower and upper thresholds based on the 1.5 x IQR rule

The distribution is right-skewed and multimodal, suggesting several distinct merchant activity tiers - likely representing different merchant types such as small, medium, and high-volume vendors.

The analysis identified 5 outlier merchants exceeding the upper bound of normal activity (~3,500 transactions). These merchants exhibit exceptionally high transaction volumes compared to the rest of the population.

However, upon cross-checking with their respective fraud rates, these outliers do not display abnormal or inflated fraud ratios. Their elevated transaction counts are therefore attributed to legitimate high-volume business behavior, not data corruption or synthetic bias.

This pattern mirrors realistic market dynamics, where a small number of large retailers process a disproportionately high share of transaction, a natural "power-law" effect observed in real-world financial ecosystems.

Overall, the merchant feature is clean, structurally valid, and behaviorally consistent. Its values are uniformly generated, semantically neutral, and show realistic diversity in transaction frequency. The absence of abnormal fraud concentrations or naming irregularities confirms that merchants behave as reliable categorical identifiers.

For modeling standpoint, this feature may be best leveraged through aggregated or statistical representation - such as per-merchant fraud rate, mean transaction amount, or temporal activity frequency - rather than as a raw categorical label. This approach is motivated by the fact that there are hundreds of distinct merchants, making one-hot encoding inefficient and prone to sparsity. More so, fraud signals appear to stem from behavioral dynamics (such as spending frequency or transaction timing) rather than merchant identity itself.

With this validation complete, we now move on to explore the next feature - category, which describes the type of product or service purchased and may reveal further behavioral distinctions between legitimate and fraudulent transactions.

📚 category¶

The category feature specifies the type of merchant or industry associated with each transaction - e.g. "gas_transport", "grocery_pos", "shopping_net", "home", etc. It represents where and how money is spent, which makes it a behavioral and risk sensitive dimension in fraud analysis.

Based on previous analysis, there are 14 unique categories, and no missing or invalid entries. This compact yet complete categorical structure suggests excellent data consistency and semantic validity - each value corresponds to a well-defined merhant type rather than arbitrary labels.

Category Distribution¶

To evaluate category balance and detect potential dominanace or underrepresentation, we review transaction counts per category:

In [43]:
txn_per_cat = df_train['category'].value_counts()
txn_per_cat
Out[43]:
count
category
gas_transport 131659
grocery_pos 123638
home 123115
shopping_pos 116672
kids_pets 113035
shopping_net 97543
entertainment 94014
food_dining 91461
personal_care 90758
health_fitness 85879
misc_pos 79655
misc_net 63287
grocery_net 45452
travel 40507

In [44]:
plt.figure(figsize=(10,6))
sns.barplot(
    y=df_train['category'].value_counts().index,
    x=df_train['category'].value_counts().values,
    palette="Blues_r"
)
plt.title("Distribution of Transactions by Category", fontsize=14)
plt.xlabel("Number of Transactions", fontsize=12)
plt.ylabel("Category", fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()
/tmp/ipython-input-4151199195.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
No description has been provided for this image

Graph 10 - Distribution of Transactions by Category

The chart above illustrates the relative transaction volume across all 14 merchant categories.

  • The dataset is dominated by gas_transport, grocery_pos and home transactions - each exceeding 120k records. These represent high-frequency, everyday purchases, characteristic of regular consumer spending

  • Mid-tier categories such as shopping_pos, kids_pets, and shopping_net maintain strong representation (≈90k-110k), reflecting diverse commercial activity across both physical and online channels

  • Lower-volume segments, including grocery_net and travel, still contain tens of thousands of transactions, ensuring that no category suffers from data sparsity

This balanced distribution indicates that the category feature is well-structured and statistically robust, with each class large enough to support meaningful fraud-rate comparisons.

Fraud Rate Analysis by Category¶

Next, we examine the total number of fraudulent transactions and fraud ratios per category:

In [45]:
fraud_per_cat = df_train.groupby('category')['is_fraud'].sum().sort_values(ascending=False)
fraud_rate_per_cat = df_train.groupby('category')['is_fraud'].mean().sort_values(ascending=False)

fig, ax1 = plt.subplots(figsize=(10,6))

# Blue bars for fraud counts
sns.barplot(
    x=fraud_per_cat.index,
    y=fraud_per_cat.values,
    color='steelblue',
    ax=ax1
)
ax1.set_ylabel("Fraudulent Transactions (Count)", color="steelblue")
ax1.tick_params(axis='x', rotation=45)

# Red line for fraud rate
ax2 = ax1.twinx()
sns.lineplot(
    x=fraud_rate_per_cat.index,
    y=fraud_rate_per_cat.values,
    color="red",
    marker="o",
    ax=ax2
)
ax2.set_ylabel("Fraud Rate", color="red")

plt.title("Fraud Distribution Across Merchant Categories", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 11 - Fraud Distribution Across Merchant Categories

The chart above compares the number of fraudulent transactions (blue bars) with fraud rate (red line) across all merchant categories.

  • The categories grocery_pos, shopping_net and misc_net dominate both in total fraud counts and relative fraud rates, marking them as the three most fraud-prone sectors in the dataset.

  • Notably, shopping_net and misc_net represent online or card-not-present channels, which are inherently more vulnerable to fraudulent activity due to weaker identity verification mechanisms.

  • The grocery_pos category - typically a physical point-of-sale (POS) channel, is showing similarly high fraud involvement, which suggests either card cloning or local misuse, both common in real world retail fraud.

  • In contrast, categories such as travel and health_fitness exhibit both low fraud counts and very low fraud rates, implying that fraudsters rarely target these sectors within the simulated environment

Overall, this dual-axis analysis highlights that fraud activity is not randomly distributed but rather clustered within specific commercial domains, primarily online retail and everyday POS categories. This pattern closely mirrors real-world fraud dynamics, where high-frequency, low-verification environments tend to attract the most fraudulent behavior

From the analyses above, the category feature demonstrates excellent data integrity, no structural anomalies and strong behavioral signal value. The fraud distribution across categories is both statistically meaningful and domain consistent, showing that fraudulent activity clusters around specific merchant types rather than occuring uniformly.

In particular, online and high-frequency sectors (shopping_net, misc_net, grocery_pos) emerge as consistently higher-risk environments, reflecting real-world vulnerabilities in card-not-present and everyday retail transactions. This insight provides direct value for model development - categorical embeddings or one-hot representations of category can help the model learn contextual risk patterns, meaning, to recognize that a $500 purchase in shopping_net might carry greater fraud likelihood than the same amount spent in travel or health_fitness.

Therefore, category is not only a clean and reliable feature, but also an informative behavioral predictor, one that signals where fraudulent behaviors are most likely to occur and should thus be explicitly incorporated into the model's feature design

💵 amt¶

The amt feature represents the monetary value of each transaction. As one of the most behaviorally revealing and predictive features in fraud detection, transaction amount provides critical insight into risk magnitude and spending intent.

From a behavioral perspective:

  • Legitimate customers tend to operate within consistent spending ranges that reflect their income and lifestyle

  • Fraudsters, on the other hand, face an optimization tradeoff: maximize profit while minimizing detection risk. This often results in two distinct fraudulent behaviors:

    1. High-value thefts, where large transactions are attempted for maximum gain

    2. Micro-transactions ("testing" behavior), where small amounts are used to probe card validity before escalating the fraud

Therefore, we expect the upper end of the transaction spectrum to show elevated fraud risk due to high-value exploitation attempts, while low-value transactions become suspicious only when they occur repeatedly or in clusters - for example, when a single card executes multiple small payments within a short time window, to the same merchant. This distinction reflects real-world anti-fraud and anti-structuring practices used in financial systems worldwide.

Data Integrity¶

Before interpreting behavioral patterns, it's essential to confirm that the amt feature represents valid and realistic monetary values

We begin by examining descriptive statistics and checking for impossible or inconsistent cases:

In [46]:
pd.set_option('display.float_format', '{:,.2f}'.format)
df_train['amt'].describe()
Out[46]:
amt
count 1,296,675.00
mean 70.35
std 160.32
min 1.00
25% 9.65
50% 47.52
75% 83.14
max 28,948.90

  • The training set contains 1.29 million transactions, all with positive monetary values, confirming that there are no invalid (negative or zero) entries.

  • The minimum amount of $1 and a maximum of about $28,949 fall within realistic bounds for everyday and high-value spending - no anomalies or simulation errors were detected

  • The mean of $70 and median of $47.5 indicate a right-skewed distribution, where most purchases are small or moderate, while a few high-value transactions stretch the upper tail.

  • The standard deviation of ≈$160 reinforces this skewness, showing wide variability consistent with real-world consumer spending behavior

These results confirm that amt is a clean, logically consistent and trustworthy feature. The values correspond well to genuine transaction magnitudes rather than simulation noise or data corruption.

Distribution of Fraudulent vs. Legitimate Amounts¶
In [47]:
plt.figure(figsize=(10,6))
sns.kdeplot(df_train[df_train['is_fraud'] == 0]['amt'], label='Legitimate', fill=True)
sns.kdeplot(df_train[df_train['is_fraud'] == 1]['amt'], label='Fraudulent', fill=True, color='red')
plt.xscale('log')
plt.title("Distribution of Transaction Amounts")
plt.xlabel("Transaction Amount ($, log scale)")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image

Graph 12 - Distribution of Transactions Amounts

The KDE plot above compares the density of transaction amounts between legitimate and fraudulent cases on a logarithmic scale.

  • Legitimate transactions cluster heavily bellow ≈$200, with density peaking between $10-$100, consistent with everyday consumer spending

  • Fraudulent transactions show two prominent peaks around $300 - $1000, indicating a strong preference for mid to high value operations. Very few frauds occur at extremely low amounts, implying that large-value exploitation is the dominant strategy in this dataset.

Overall, this visualization confirms that frauds are not uniformly distributed across the monetary spectrum, they occur disproportionately at higher transaction values, which provides a strong predictive signal for machine learning models

Category-Amount interaction (Which sectors have expensive frauds?)¶

By combining amt and category together, we can detect where the largest fraudulent transactions occur:

In [48]:
fraud_amt_by_cat = (
    df_train[df_train['is_fraud'] == 1]
    .groupby('category')['amt']
    .mean()
    .sort_values(ascending=False)
)
plt.figure(figsize=(10,6))
fraud_amt_by_cat.plot(kind='bar', color='crimson')
plt.title("Average Fraudulent Transaction Amount by Category")
plt.ylabel("Average Amount ($)")
plt.xlabel("Category")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Graph 13 - Average Fraudulent Transaction Amount by Category

The chart displays the mean dollar amount of fraudulent transactions per merchant category.

  • Fraudulent purchases in shopping_net, shopping_pos and misc_net average $800 - $1000, indicating that fraudsters target high-value retail sectors where goods can be easily monetized.

  • Mid-range categories like entertainment and grocery_pos show moderate fraudulent amounts (\250$ - $500), suggesting attempts to blend large purchases within normal consumer behavior

In contrast, essential service categories (e.g., gas_transport, health_fitness, personal_care) exhibit low-value frauds, consistent with their lower resale potential.

This pattern aligns with economic rationality in fraud behavior - targeting sectors with the highest financial gain and lowest detection barriers.

Temporal Analysis: Amount over Time¶

The goal here is to understand whether transaction amounts and particularly high-value fraudulent amounts, show temporal patterns.

By examining the evolution of transaction values over months, weekdays, and hours, we can determine when high-risk behaviors are most likely to occur

Let's start by visualizing the average daily transaction amount (both overall and for frauds only):

In [49]:
avg_amt_daily = df_train.groupby('date')['amt'].mean()
avg_amt_daily_fraud = df_train[df_train['is_fraud'] == 1].groupby('date')['amt'].mean()

# plot
plt.figure(figsize=(12,6))
plt.plot(avg_amt_daily, label="Average Amount (All)", color='steelblue', linewidth=1.3)
plt.plot(avg_amt_daily_fraud, label="Average Amount (Fraud)", color='crimson', linewidth=1.3)
plt.title("Average Transaction Amount Over Time")
plt.xlabel("Date")
plt.ylabel("Average Amount ($)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 14 - Average Transaction Amount Over Time

The graph clearly distinguishes between legitimate and fraudulent spending behaviors

  • legitimate transactions show stable and consistent spending habits, averaging around $60 - $80 per transaction with minimal daily variation

  • Fraudulent transactions, however, are highly erratic - their average value oscillates dramatically between $200 and $1000, spiking at irregular intervals.

This pattern suggests that fraudulent activity occurs in intermittent high-value bursts, likely corresponding to coordinated attack periods or isolated high-gain attempts. The persistent vertical gap between the two lines further confirms that fraudulent transactions consistently involve much larger sums, even though they represent a minor fraction of total volume.

In essence, while normal spending is predictable and stable, fraud behavior is sporadic, opportunistic, and high-impact, which is a defining characteristic of real-world financial crime

The analysis of amt demonstrates that transaction value is both clean and behaviorally rich. It consistently differentiates legitimate and fraudulent patterns, with fraud showing sporadic, high-value bursts and a clear preference for certain high-gain sectors and off-hour timings. These findings confirm that amt is a core predictive driver in fraud detection, capturing both economic magnitude and behavioral intent.

Having established the financial characteristics of fraud, we now turn to demographic indicators - starting with the gender feature, to explore whether transaction behaviors and fraud likelihood vary across customer profiles

♀♂ gender¶

The gender feature introduces an interesting behavioral and ethical dimension. On the one hand, it could reveal differences in spending patterns, risk exposure, or fraud targeting strategies between males and females, potentially useful for model interpretability.

On the other hand, incorporating gender directly into predictive models raises ethical and fairness concerns: bias amplification could cause certain groups to be unfairly flagged as high-risk.

Thus, the goal here is exploratory understanding, not predictive discrimination:

Data Integrity¶
In [50]:
df_train['gender'].value_counts(dropna=False)
Out[50]:
count
gender
F 709863
M 586812

  • The training set contains only two valid gender categories: M (male) and F (female)

  • There are no missing or invalid entries, conirming full data completeness

In addition, we can see that the training set is not perfectly gender-balanced. There is a noticeably larger number of female records compared to male ones.

This imbalance is important to acknowledge: when comparing raw fraud counts, one gender might appear to have more fraud cases simply because it has more total transactions. Therefore, we will normalize by population to compare fraud rates, not absolute counts.

Overall, the gender feature is structurally clean, consistent, and ready for analysis

Fraud Rate by Gender¶
In [51]:
fraud_by_gender = df_train.groupby('gender')['is_fraud'].mean() * 100

plt.figure(figsize=(6,4))
sns.barplot(x=fraud_by_gender.index, y=fraud_by_gender.values, palette='Reds')
plt.title("Fraud Rate by Gender")
plt.xlabel("Gender")
plt.ylabel("Fraud Rate")
plt.show()
/tmp/ipython-input-2461314860.py:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=fraud_by_gender.index, y=fraud_by_gender.values, palette='Reds')
No description has been provided for this image

Graph 15 - Fraud Rate by Gender

After normalization, we find that although there are more female records overall, the fraud rate is higher among males. This means that proportionally, men are involved in fraudulent activity more often per transaction than women in this training set.

This could reflect several factors, for example:

  • behavioral tendencies (e.g., higher-risk spending or greater exposure to certain categories)

  • demographic differences in transaction volume

  • or simply simulation parameters in the data generator

While the difference is statistically noticeable, it should be interpreted cautiously: correlation does not imply causation, and using gender directly as a predictive feature could bias the model unfairly.

To conclude, this analysis shows that gender correlates modestly with fraud occurrence: men exhibit a slightly higher fraud rate, even though women dominate transaction volume. However, given the potential for ethical bias, gender should be treated as an interpretable variable, not a decisive predictor.

Next, we turn to a more geographical perspective, analyzing the impact of location-related features on fraud probability

🌎 Geographical Impact on Fraud¶

The geographical dimension of financial behavior often plays a critical role in understanding and predicting fraudulent activity. Fraud is not only a matter of who commits it and how, but also where it occurs. Geographic features can reveal:

  • Hotspots of fraudulent activity

  • Behavioral irregularities, such as purchases far from the cardholder's home region

  • Socioeconomic influences, since wealthier or denser urban areas often exhibit distinct spending and fraud patterns

  • Transaction network dynamics, reflected in the physical or digital distance between customers and merchants

To investigate these spatial aspects, we will combine and analyze several location-related features in our dataset:

Feature Description
city, state, zip , street Administrative and regional identifiers for cardholder location
lat, long Geographic coordinates of the cardholder
city_pop Estimated population of the cardholder’s city
merch_lat, merch_long Geographic coordinates of the merchant involved in the transaction

Together, these variables allow us to explore fraud behavior across multiple spatial layers - from broad national trends to fine-grained local patterns

Data Integrity and Validation of Geographic Features¶

Before analyzing spatial fraud behavior, we will ensure that the geographical data we are working with is accurate, consistent, and realistic. Fraud analysis depends heavily on location-based reasoning, if our coordinates or city-state information are unreliable, any derived patterns would lose meaning.

To address this, we begin by validating all geographic features present in our dataset:

Feature Description Validation Objective
city, state , street , zip Administrative identifiers Verify that all listed locations correspond to real or valid U.S. places
lat, long Cardholder’s coordinates Confirm they fall within the valid U.S. latitude-longitude range
merch_lat, merch_long Merchant’s coordinates Verify geographic plausibility - merchants should also be within U.S. boundaries
city_pop City population estimate Check for realistic, non-negative population sizes

To conduct this validation, we use the U.S. Cities Database publicly available on GitHub by Kelvins.

This comprehensive dataset includes over 19,000 verified U.S. cities and provides:

  • City and state names

  • Geographic coordinates (latitude and longitude)

By cross-referencing our dataset with this authoritative U.S. source, we can:

  1. Verify that our cities and states exist and are properly matched

  2. Confirm that our coordinates (lat, long, merch_lat, merch_long) fall within realistic U.S. boundaries

  3. Detect any synthetic anomalies or out-of-range coordinates, which might indicate data generation artifacts

In [52]:
url = "https://raw.githubusercontent.com/kelvins/US-Cities-Database/main/csv/us_cities.csv"
us_dataset = pd.read_csv(url)

# Preview
print(us_dataset.head())
print(us_dataset.columns)
   ID STATE_CODE STATE_NAME      CITY          COUNTY  LATITUDE  LONGITUDE
0   1         AK     Alaska      Adak  Aleutians West     56.00    -161.21
1   2         AK     Alaska  Akiachak          Bethel     60.89    -161.39
2   3         AK     Alaska     Akiak          Bethel     60.89    -161.20
3   4         AK     Alaska    Akutan  Aleutians East     54.14    -165.79
4   5         AK     Alaska  Alakanuk        Kusilvak     62.75    -164.60
Index(['ID', 'STATE_CODE', 'STATE_NAME', 'CITY', 'COUNTY', 'LATITUDE',
       'LONGITUDE'],
      dtype='object')
In [53]:
# Count unique values in each geographic feature
geo_features = ['city', 'street', 'state', 'zip', 'lat', 'long', 'merch_lat', 'merch_long', 'city_pop']
unique_counts = {col: df_train[col].nunique() for col in geo_features}

print("Unique values in each geographic feature:\n")
for feature, count in unique_counts.items():
    print(f"{feature:<12}: {count:,}")
Unique values in each geographic feature:

city        : 894
street      : 983
state       : 51
zip         : 970
lat         : 968
long        : 969
merch_lat   : 1,247,805
merch_long  : 1,275,745
city_pop    : 879

The table below summarizes the number of unique values across all geographic features, helping us to assess diversity, realism and internal consistency in the dataset:

Feature Unique Values Interpretation
city 894 Matches a plausible number of medium-to-large U.S. cities represented in the simulation. Indicates broad geographic diversity without redundancy.
street 983 A realistic variety of simulated street names - confirming address-level granularity. This aligns with expectations for synthetic but human-like data generated through name-based simulation (e.g., “Maple St”, “Main Ave”).
state 51 Perfectly consistent — includes all 50 U.S. states plus Washington D.C.
zip 970 Reasonable for a dataset of this scale, ZIP codes are highly granular, and ~1,000 unique values suggest broad spatial coverage without redundancy.
lat, long 968 / 969 Indicates that each cardholder or cardholder location corresponds to a specific coordinate pair - a near one-to-one relationship. The small difference (969 vs. 968) likely reflects rounding or minimal coordinate overlap.
merch_lat, merch_long 1,247,805 / 1,275,745 Extremely high diversity, nearly matching the total number of transactions — implying that each merchant transaction has a unique coordinate pair. This is consistent with how synthetic merchant IDs were generated in the dataset.
city_pop 879 Close to matching the number of cities (894), confirming internal consistency - most cities have distinct population values. Minor overlaps may occur for small towns with shared population estimates.

So far, we can see that all of the geographic features demonstrate logical diversity and consistency, now let's see how clean their values are, by cross-referencing with the U.S. database:

In [54]:
# Normalize state codes
valid_states = set(us_dataset['STATE_CODE'].unique())
state_match_ratio = df_train['state'].isin(valid_states).mean()

print(f"{state_match_ratio:.2%} of state codes in the dataset match valid U.S. states")

# Show any invalid or unknown state codes
invalid_states = df_train.loc[~df_train['state'].isin(valid_states), 'state'].unique()
print("Invalid or unrecognized states:", invalid_states)
100.00% of state codes in the dataset match valid U.S. states
Invalid or unrecognized states: []

As we can see, 100% of the entries in the state feature match the valid U.S. states, confirming consistent and clean data.

Next, we validate lat, long, merch_lat and merch_long. The continental U.S. lies approximately within the following ranges:

Dimension Minimum Maximum
Latitude 24.0° N 49.0° N
Longitude –125.0° W –66.0° W

We will use these boundaries to check whether all geographic coordinates fall within realistic U.S. limits:

In [55]:
# Define valid U.S. geographic boundaries
lat_min, lat_max = 24.0, 49.0
lon_min, lon_max = -125.0, -66.0

# Check cardholder coordinates
invalid_lat = df_train[~df_train['lat'].between(lat_min, lat_max)]
invalid_lon = df_train[~df_train['long'].between(lon_min, lon_max)]

# Check merchant coordinates
invalid_merch_lat = df_train[~df_train['merch_lat'].between(lat_min, lat_max)]
invalid_merch_lon = df_train[~df_train['merch_long'].between(lon_min, lon_max)]

# Results
print(f"Invalid cardholder latitudes: {len(invalid_lat)}")
print(f"Invalid cardholder longitudes: {len(invalid_lon)}")
print(f"Invalid merchant latitudes: {len(invalid_merch_lat)}")
print(f"Invalid merchant longitudes: {len(invalid_merch_lon)}")
Invalid cardholder latitudes: 4679
Invalid cardholder longitudes: 4679
Invalid merchant latitudes: 11062
Invalid merchant longitudes: 5227

The coordinate validation shows that while the vast majority of both cardholder and merchant locations fall within valid U.S. boundaries, a small minority of points (≈ 0.4% for cardholders and ≈ 0.8% for merchants) lies slightly outside the continental latitude - longitude range. This deviation is expected and acceptable, as it likely represents U.S. territories (such as Hawaii, Alaska, or Puerto Rico) or minor coordinate noise introduced during data simulation. Therefore, these outliers do not compromise the overall geographic integrity of the dataset, the coordinate features remain realistic, coherent and suitable for analysis

Let's analyze the city feature:

In [56]:
# Normalize both city columns for consistent comparison
df_train['city_norm'] = df_train['city'].str.title().str.strip()
us_dataset['city_norm'] = us_dataset['CITY'].str.title().str.strip()

# Create a set of valid cities for fast lookup
valid_cities = set(us_dataset['city_norm'])

# Check validity
match_ratio = df_train['city_norm'].isin(valid_cities).mean()
print(f"{match_ratio:.2%} of cities in the dataset match valid U.S. cities")

# Mismatches
invalid_cities = df_train.loc[~df_train['city_norm'].isin(valid_cities), 'city'].unique()[:20]
print("Sample of non-matching cities:", invalid_cities)
99.58% of cities in the dataset match valid U.S. cities
Sample of non-matching cities: ['New York City' 'Pembroke Township']
In [57]:
# Check for near matches / alternative naming
us_dataset[us_dataset['CITY'].str.contains("New York", case=False)]
us_dataset[us_dataset['CITY'].str.contains("Pembroke", case=False)]
Out[57]:
ID STATE_CODE STATE_NAME CITY COUNTY LATITUDE LONGITUDE city_norm
4057 4058 FL Florida Pembroke Pines Broward 26.02 -80.30 Pembroke Pines
4638 4639 GA Georgia Pembroke Bryan 32.16 -81.55 Pembroke
9351 9352 KY Kentucky Pembroke Christian 36.80 -87.33 Pembroke
10379 10380 MA Massachusetts North Pembroke Plymouth 42.09 -70.79 North Pembroke
10406 10407 MA Massachusetts Pembroke Plymouth 42.06 -70.80 Pembroke
11329 11330 ME Maine Pembroke Washington 44.97 -67.20 Pembroke
15445 15446 NC North Carolina Pembroke Robeson 34.69 -79.18 Pembroke
18313 18314 NY New York East Pembroke Genesee 43.00 -78.31 East Pembroke
27148 27149 VA Virginia Pembroke Giles 37.33 -80.62 Pembroke

Only two cities failed to match exactly: New York City and Pembroke Township. A closer inspection shows that:

  • New York City corresponds to New York in the U.S. Cities Database, which is the same geographic entity differing only by the "City" suffix

  • Pembroke Township aligns with multiple valid Pembroke locations across states such as Florida, Georgia, and North Carolina - all legitimate U.S municipalities.

These discrepancies stem purely from synthetic naming variations introduced by the simulator, not from invalid or missing data. Therefore:

  • The city feature is structurally complete, geographically accurate, and free from semantic inconsistencies

  • No cleaning or data correction is required

Let's validate street values. The street values are simulated, so we can't verify them against a real-world database. However, we can still check data structure integrity, ensuring they look like real street names and are diverse:

In [58]:
# Checking diversity
print(df_train['street'].sample(10).tolist())
['72269 Elizabeth Field Apt. 132', '7529 Carter Well Suite 262', '41851 Victor Drives Suite 219', '220 Frank Gardens', '597 Jenny Ford Apt. 543', '37910 Ward Lights', '663 Anna Plaza', '144 Martinez Curve', '6970 Blake Trail', '950 Sheryl Spurs']

The street names are looking realistic, and there are clearly no missing or clearly invalid entries. This is expected and we can therefore continue checking the zip feature next. We will check that ZIP codes follow valid U.S. formatting - numeric and in range (00500-99950):

In [59]:
invalid_zips = df_train[(df_train['zip'] < 501) | (df_train['zip'] > 99950)]
print(f"Invalid ZIP codes: {len(invalid_zips)}")
Invalid ZIP codes: 0

As we can see, all ZIPs are valid and within range - confirming geographical plausibility.

Finally, we verify that population values are positive, realistic, and demographically plausible

In [60]:
print(df_train['city_pop'].describe())
count   1,296,675.00
mean       88,824.44
std       301,956.36
min            23.00
25%           743.00
50%         2,456.00
75%        20,328.00
max     2,906,700.00
Name: city_pop, dtype: float64
In [61]:
non_integer_pop = df_train[~(df_train['city_pop'] % 1 == 0)]
print(f"Number of non-integer population entries: {len(non_integer_pop)}")
Number of non-integer population entries: 0

The city_pop feature shows a strongly right-skewed distribution, which perfectly aligns with the real demographic structure of the United States:

  • Most records come from smaller towns or suburban areas - reflected in the low median (≈2500 residents)

  • The upper quartile (≈20,000) represents medium-sized cities

  • The extreme tail (max ≈2.9M) corresponds to major metropolitan areas like New York, Los Angeles, or Chicago.

The mean (≈88K) being much higher than the median confirms the long-tail nature of U.S. urban populations - many small towns and few very large cities.

There are no negative or unrealistic values, and all population magnitudes are demographically plausible (ranging from small rural communities to dense urban centers). This confirms that the city_pop feature is clean.

Geographical Fraud Analysis¶

Having verified the integrity and realism of our geographical data, we now move from validation to spatial exploration, using geographic and demographic attributes to uncover patterns that explain where and why fraudulent activity occurs.

We focus on three main research questions:

  1. Distance-Fraud Relationship:

    Do transactions that occur farther from the cardholder's location have a higher likelihood of being fraudulent?

  2. Population-Fraud Correlation:

    Is fraud more prevalent in densely populated cities, or do smaller towns experience disproportionately higher fraud rates?

  3. State-Level Fraud Analysis:

    Which states contribute the most to total fraud, and which exhibit the highest fraud rates relative to their transaction volumes?

Map Visualization:

In [62]:
# Prepare city-level fraud stats
city_stats = df_train.groupby('city').agg(
    total_txn =('is_fraud', 'count'),
    total_fraud = ('is_fraud', 'sum'),
    avg_population=('city_pop', 'mean')
)
city_stats['fraud_rate'] = (city_stats['total_fraud'] / city_stats['total_txn'])

# Normalize city names in both datasets
city_stats = city_stats.reset_index()
city_stats['city_norm'] = city_stats['city'].str.title().str.strip()
us_dataset['city_norm'] = us_dataset['CITY'].str.title().str.strip()

# Merge with coordinates
city_map_data = pd.merge(
    city_stats,
    us_dataset[['city_norm', 'LATITUDE', 'LONGITUDE', 'STATE_CODE', ]],
    on= 'city_norm',
    how='inner'
)
In [63]:
# Initialize map centered on continental US
fraud_map = folium.Map(location=[37.5, -96.5], zoom_start=4, tiles='CartoDB positron')

# Create cluster for better performance
marker_cluster = MarkerCluster().add_to(fraud_map)

# Add markers
for _, row in city_map_data.iterrows():
  if row['total_txn'] < 30: # skip small cities (low data reliability)
    continue
  popup_text = (f"<b>City:</b> {row['city_norm']}<br>"
                f"<b>State:</b> {row['STATE_CODE']}<br>"
                f"<b>Population:</b> {int(row['avg_population']):,}<br>"
                f"<b>Fraud Rate:</b> {row['fraud_rate']:.2f}%<br>"
                f"<b>Fraud Cases:</b> {int(row['total_fraud'])}<br>"
                f"<b>Total Transactions:</b> {int(row['total_txn'])}")
  # Color code - higher fraud rate = darker red
  color = 'green' if row['fraud_rate'] < 0.5 else 'orange' if row['fraud_rate'] < 2 else 'red'

  folium.CircleMarker(
      location=[row['LATITUDE'], row['LONGITUDE']],
      radius=max(3, min(row['fraud_rate'] / 2, 10)), # scale size with fraud rate
      color=color,
      fill=True,
      fill_opacity=0.6,
      popup=popup_text
  ).add_to(marker_cluster)

fraud_map
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Graph 16 - Interactive Map of Fraud Rate

The map above visualizes the spatial distribution of fraud cases across US. Each marker represents a city, colored-coded by its fraud intensity and scaled by the relative severity of fraudulent activity

  • 🟢 Green markers represent low-risk cities (fraud rate < 0.5%)

  • 🟠 Orange markers indicate moderate-risk regions (0.5-2%)

  • 🔴 Red markers highlight fraud hotspots (>2%)

Here are the key observations:

  1. Widespread moderate activity:

    The majority of US cities display yellow-orange markers, indicating moderate levels of fraud between 0.5-2%. This suggests that fraud is not isolated to specific regions, but rather distributed across the entire country, consistent with the idea that digital and card-based fraud is a nationwide phenomenon.

  2. Concentration in the South and East:

    Noticeably higher fraud rates are observed in the southern and eastern United States Cities, including dense urban corridors and economically active states. These regions host a higher concentration of large metropolitan areas, which naturally experience greater transaction volume, and thus higher exposure to fraud attempts.

  3. Peripheral regions with low risk:

    Outlying cities in areas such as Alaska, Hawaii and Puerto Rico (San Juan) show mainly green points, reflecting very low fraud rates. This pattern likely results from lower population

  4. Population does not directly predict fraud:

    Despite including population data in the visualization, there is no clear linear correlation between city size and fraud rate. Some highly populated cities (e.g., large metro areas) exhibit moderate fraud rates, while certain smaller towns demonstrate disproportionately high rates. This highlights that fraud exposure is influenced not only by population but also by factors such as economic activity, transaction diversity, and local enforcement intensity.

Overall, the interactive map illustrates that fraudulent activity in the US is both geographically diverse and spatially correlated - areas with dense commerce and urban concentration tend to attract more fraud attempts, yet smaller, less populated regions are not immune.

Distance-Fraud Relationship:

In [64]:
def haversine(lat1, lon1, lat2, lon2):
    # Convert degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Earth radius in km
    return c * r

# Apply to dataframes
df_train['distance_cardholder_merchant'] = haversine(
    df_train['lat'], df_train['long'], df_train['merch_lat'], df_train['merch_long']
)

df_test['distance_cardholder_merchant'] = haversine(
    df_test['lat'], df_test['long'], df_test['merch_lat'], df_test['merch_long']
)
In [65]:
# Bin distances into ranges (for clarity)
bins = [0, 1, 5, 10, 50, 100, 500, 1000, 5000, np.inf]
labels = ["<1km", "1–5km", "5–10km", "10–50km", "50–100km", "100–500km", "500–1000km", "1000–5000km", ">5000km"]

df_train['distance_group'] = pd.cut(df_train['distance_cardholder_merchant'], bins=bins, labels=labels)

# Compute fraud rate per distance group
distance_fraud_stats = (
    df_train.groupby('distance_group', observed=False)['is_fraud']
    .agg(['count', 'sum'])
    .rename(columns={'count': 'Total Txns', 'sum': 'Fraud Txns'})
)
distance_fraud_stats['Fraud Rate (%)'] = 100 * distance_fraud_stats['Fraud Txns'] / distance_fraud_stats['Total Txns']

display(distance_fraud_stats)
Total Txns Fraud Txns Fraud Rate (%)
distance_group
<1km 106 1 0.94
1–5km 2598 9 0.35
5–10km 7862 43 0.55
10–50km 254630 1430 0.56
50–100km 728628 4276 0.59
100–500km 302851 1747 0.58
500–1000km 0 0 NaN
1000–5000km 0 0 NaN
>5000km 0 0 NaN

We used the Haversine distance function to calculate the shortest distance (in kms) between two points on Earth's surface using their latitude and longitude coordinates.

Using the distances between the cardholder's coordinates and the merchant's location, transactions were grouped into distance intervals to analyze how physical proximity influences fraud probability.

Based on the findings:

  1. Majority of transactions are local or regional, since nearly all transactions fall within 500km. There are no transactions beyond that range.

  2. There is a slightly elevated fraud rate for very short distances, for instance, transactions occurring within less than 1 km from the cardholder's registered location show a fraud rate of 0.94%, slightly higher than the average across other bins. This might have a connection with the usage of online services or network purchases

Based on those findings, we can see that distance might not necessarily be the key predictor in this dataset, however, it can still be useful for finding patterns when combined with other features.

Zip feature usefulness

The zip feature includes around 970 unique values, indicating realistic granularity across the dataset. This confirms that the ZIP code field is syntactically valid and exhibits sufficient diversity to represent geographically distributed transactions:

In [66]:
zip_city_corr = df_train.groupby(['city', 'zip']).size().reset_index(name='count')
print(f"Number of (city, zip) pairs: {len(zip_city_corr)} vs total cities: {df_train['city'].nunique()}")
Number of (city, zip) pairs: 970 vs total cities: 894

However, further exploration revealed that ZIP codes are highly correlated with city - with 969 unique (city,ZIP) pairs across 894 cities, yielding an almost one-to-one relationship. This means ZIP codes, while structurally valid, do not contribute additional geographic insight beyond the city field. Therefore, the zip feature will be excluded from subsequent analysis to reduce redundancy

In [67]:
df_train = df_train.drop(columns=['zip'])
df_test = df_test.drop(columns=['zip'])

print("ZIP column dropped from both train and test datasets")
ZIP column dropped from both train and test datasets

Our comprehensive geographic exploration demonstrates that the dataset's spatial features are accurate, realistic, and analytically reliable, offering a solid foundation for spatial fraud modeling.

  • All geographic attributes were validated against verified U.S. data, and the dataset showed near-perfect consistency, with only minimal outliers likely representing U.S. territories or synthetic noise

  • Fraudulent activity is broadly distributed across the United States rather than localized to specific regions. The interactive map highlights moderate fraud intensity (0.5 - 2%) in most areas, with denser fraud presence in the South and East, where economic activity and transaction volumes are higher

  • The average transaction distance (~76km) is virtually identical for both legitimate and fraudulent transactions, implying that distance is not a discriminative feature in this dataset, but remains valuable for cross-temporal analysis

  • Population size alone does not predict fraud - both small and large cities experience similar fraud rates. This suggests that fraud exposure depends more on economic behavior and transaction diversity than on city size.

  • A preliminary assessment shows that ZIP codes likely provide redundant geographic information, strongly correlated with city. They can be safely omitted from further modeling or visualization to simplify the analysis

In conclusion, the geographic features are clean, interpretable, and diverse, supporting the reliability of subsequent modeling tasks while confirming that fraudulent activity is geographically widespread rather than isolated

🔨 job¶

The job feature represents the cardholder's occupation. Occupational data can reveal socioeconomic and behavioral patterns that correlate with both transaction behavior and fraud vulnerability.

From a behavioral perspective:

  • Certain professions (for instance, executives, engineers, or salespeople) may show higher transaction volume due to lifestyle or travel

  • Jobs with frequent online spending or travel might face greater fraud exposure

  • Conversely, other professions might exhibit lower fraud risk, potentially due to fewer high-value purchases or less card usage

However, analyzing this feature requires caution, we must first make sure the data is clean, meaningful and ethically interpreted, since job-based profiling could introduce bias if misused:

Data Integrity¶
In [68]:
print(f"Number of unique jobs in the training set: {df_train['job'].nunique()}")
print(f"Number of unique jobs in the test set: {df_test['job'].nunique()}")

print("\nSample job from the training set:")
print(df_train['job'].sample(10).tolist())
Number of unique jobs in the training set: 494
Number of unique jobs in the test set: 478

Sample job from the training set:
['Loss adjuster, chartered', 'Waste management officer', 'Education administrator', 'Chief Executive Officer', 'Sports development officer', 'Firefighter', 'Secondary school teacher', 'Heritage manager', 'Web designer', 'Systems developer']

The dataset contains ≈ 500 unique job titles, covering a wide range of occupations, from Neurosurgeon to Tax adviser. There are no missing entries, confirming that the feature is structurally complete. However, the large number of categories introduces sparsity: most job titles appear only a handful of times. This sparsity may limit direct interpretability and requires aggregation or encoding (e.g., target encoding, frequency grouping) for machine learning

Job Frequency and Fraud Rate¶

Let's explore which occupations appear most frequently and whether certain jobs exhibit higher than average fraud rates

In [69]:
job_freq = df_train['job'].value_counts().head(15)
print(job_freq)
job
Film/video editor                      9779
Exhibition designer                    9199
Naval architect                        8684
Surveyor, land/geomatics               8680
Materials engineer                     8270
Designer, ceramics/pottery             8225
Systems developer                      7700
IT trainer                             7679
Financial adviser                      7659
Environmental consultant               7547
Chartered public finance accountant    7210
Scientist, audiological                7174
Chief Executive Officer                7172
Copywriter, advertising                7146
Comptroller                            6730
Name: count, dtype: int64

The most common occupations in the training set include technical, creative, and financial professions, such as Film/Video Editor, Exhibition Designer, Naval Architect, and Financial Adviser. This reflects a diverse yet synthetic occupational landscape, where roles were randomly assigned to ensure variety rather than mirroring real-world job frequency.

The dominance of certain creative and technical titles also suggests that the dataset is balanced by design rather than by socioeconomic distribution, meaning job frequencies do not represent actual labor-market proportions, they are primarily useful for behavioral segmentation and categorical encoding during modeling

In [70]:
# Filter jobs with sufficient data
filtered_jobs = (
    df_train.groupby('job')['is_fraud']
    .agg(['count', 'mean'])
    .rename(columns={'count': 'total_txn', 'mean': 'fraud_rate'})
    .query('total_txn >= 100') # filter out rare occupations for sufficient representation
    .sort_values(by='fraud_rate', ascending=False)
)
filtered_jobs['fraud_rate'] *= 100

print(f"Number of job categories after filtering: {filtered_jobs.shape[0]}")
filtered_jobs.head(15)
Number of job categories after filtering: 475
Out[70]:
total_txn fraud_rate
job
Lawyer 540 5.19
TEFL teacher 533 4.13
Community development worker 536 4.10
Clinical cytogeneticist 508 3.54
Writer 504 2.98
Geneticist, molecular 545 2.94
Conservator, museum/gallery 514 2.92
Magazine journalist 533 2.63
Field trials officer 518 2.51
Civil Service administrator 506 2.37
Medical technical officer 1066 2.35
Charity officer 519 2.31
Pharmacist, hospital 1059 2.27
Minerals surveyor 530 2.26
Engineer, structural 492 2.24
In [71]:
plt.figure(figsize=(10,6))
sns.barplot(
    y=filtered_jobs.head(15).index,
    x=filtered_jobs.head(15)['fraud_rate'],
    hue=filtered_jobs.head(15).index,
    palette='Reds_r',
    legend=False
)
plt.title('Top 15 Jobs by Fraud Rate (%) - Filtered (≥100 transactions)')
plt.xlabel('Fraud Rate (%)')
plt.ylabel('Job Title')
plt.show()
No description has been provided for this image

Graph 17 - Top 15 jobs by Fraud Rate (Filtered)

From the following observations we can conclude that:

  1. Fraud rates are realistic and stable, ranging roughly between 2-5% among the top jobs. This validates our filtering step as it removes random outliers caused by small-sample bias

  2. The top jobs span multiple sectors - law, education, healthcare, journalism, and science - suggesting no single occupational domain dominates in fraudulent activity. Instead, fraud appears evenly distributed across various professions, which is consistent with synthetic data where fraud is not occupationally biased.

  3. Some profession like lawyers, consultants or writers may reflect higher transaction independence - individuals who manage their own payments, travel, or online activities, possibly increasing exposure to fraud-like transactions. Interestingly, several specialized scientific and medical professions appear among the higher-fraud categories. However, these anomalies are most likely artifacts of the synthetic data generation process, where occupations were randomly assigned and not causally linked to fraud risk, resulting in spurious correlations that do not reflect real-world behavior.

  4. Each listed occupation has around 500 - 1000 transactions, confirming adequate sample size and reliability. This means small fluctuations might occur due to random variance but still indicate coherent model behavior.

Overall, the job feature demonstrates strong data completeness and semantic richness, offering potential insights into behavioral differences among cardholders. However, due to its high cardinality and synthetic assignment, raw job titles are not directly predictive of fraud risk. The feature remains valuable for modeling when transformed, for example, through frequency encoding, target encoding, or sector based grouping - which will help capture broad socioeconomic patterns without introducing overfitting.

Having validated and explored occupational patterns, we can now turn to another demographic variable - the dob (date of birth) feature - to investigate whether age-related behavioral trends influence fraud likelihood

👶 dob¶

The dob feature captures the date of birth of the cardholder, representing a fundamental demographic attribute. Age-related information can be an important factor in fraud detection, as spending habits, digital literacy and risk exposure can vary significantly across age groups.

From an analytical perspective:

  • Younger users might exhibit more online or mobile-driven spending, possibly increasing exposure to digital fraud

  • Middle-aged users often perform higher-value transactions, making them more attractive targets for fraudsters

  • Older users may show more stable spending patterns, which can make deviations more detectable.

Before drawing any conclusions, we first verify that the dob feature is well-structured and realistic:

Data Integrity¶

The first step in validating the dob feature is to confirm that all date values are correctly formatted and fall within plausible human age boundaries. To do so, we inspect the minimum and maximum birthdates and visualize their chronological distribution

In [72]:
df_train['dob'] = pd.to_datetime(df_train['dob'], errors='coerce')
df_test['dob'] = pd.to_datetime(df_test['dob'], errors='coerce')
print(f"Minimum DOB: {df_train['dob'].min()}")
print(f"Maximum DOB: {df_train['dob'].max()}")
Minimum DOB: 1924-10-30 00:00:00
Maximum DOB: 2005-01-29 00:00:00

Legal and logical context:

Credit card eligibility varies by country, but typically in the U.S., the minimum age is 18 to open a credit account in one's own name. Minors can only have a card as authorized users on a parent's account. So in a U.S. - based dataset like this one, individuals born after 2001 (under 18 in 2019) would be highly suspicious or implausible as cardholders - they should either be authorized users, not primary cardholders, or represent synthetic noise from the data generator

Let's first extract all transactions made by cardholders under 18 years old, since this violates typical credit-card eligibility rules (In the U.S). We will examine their count and fraud ratio

In [73]:
# Compute age column
df_train['age'] = (df_train['year'] - df_train['dob'].dt.year)
df_test['age'] = (df_test['year'] - df_test['dob'].dt.year) # 'age' will be used as a feature in the training process, therefore we added it to test set

print(f"Minimum age: {df_train['age'].min()}")
print(f"Maximum age: {df_train['age'].max()}")
print(df_train[['dob', 'date', 'age']].head())
Minimum age: 14
Maximum age: 96
         dob        date  age
0 1988-03-09  2019-01-01   31
1 1978-06-21  2019-01-01   41
2 1962-01-19  2019-01-01   57
3 1967-01-12  2019-01-01   52
4 1986-03-28  2019-01-01   33

Now that we have age column, we can drop dob column from the training and test sets:

In [74]:
df_train = df_train.drop(columns=['dob'], errors='ignore')
df_test = df_test.drop(columns=['dob'], errors='ignore')

print("dob column dropped from both train and test datasets")
dob column dropped from both train and test datasets
In [75]:
illegal_age_mask = df_train['age'] < 18
illegal_age_txn = df_train[illegal_age_mask]

print(f"Number of transactions with cardholders under 18: {len(illegal_age_txn)}")
Number of transactions with cardholders under 18: 13430
In [76]:
# Fraud ratio
fraud_ratio_illegal = illegal_age_txn['is_fraud'].mean() * 100
print(f"Fraud rate among underage transactions: {fraud_ratio_illegal:.2f}%")
Fraud rate among underage transactions: 0.45%

We've found 13,430 transactions that are linked to cardholders under 18 years old. Given that our training set has around 1.29 million rows, that's roughly 1% of all transactions. This confirms that underage entries exist but represent a small fraction of the dataset. Moreover, the fraud rate among these under-18 records is 0.45%, which only suggests that the presence of underaged cardholders likely does not encode a special behavioral signal related to fraud. Instead, it is more likely consistent with random simulation variance from the data generator

Therefore, given that the dataset is synthetically generated, these entries are retained in the dataset, as they do not distort statistical distributions or introduce bias. Their inclusion helps preserve the dataset's overall structure and ensures that subsequent models are trained on the full synthetic variety of profiles

Let us now check the age distribution among cardholders:

In [77]:
plt.figure(figsize=(10,5))
sns.histplot(df_train['age'], bins=40, kde=True, color='mediumseagreen')
plt.title('Distribution of Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Graph 18 - Distribution of Ages

The age distribution is realistic, continuous and clean, with no formatting errors or implausible outliers. Most cardholders are young to middle-aged adults (25-55 year old), which aligns with the profile of typical active credit card users. A small fraction of younger entries (under 18) and older entries (above 85) are also present. Both represent natural statistical tails of the synthetic population rather than real anomalies. These groups are rare and it reflects the real-world scenario, where elderly and younger generations are more often expected to use cash rather than credit cards

Overall, the dob feature is clean and reliable, and the derived age variable can be confidently used in analysis and modelling part

Average transaction amount per age¶
In [78]:
age_stats = df_train.groupby('age').agg(
    avg_amt=('amt', 'mean'),
    fraud_rate=('is_fraud', 'mean'),
    transaction_count=('is_fraud', 'count')
).reset_index()

age_stats['fraud_rate'] *= 100 # Convert fraud rate to percentage
In [79]:
plt.figure(figsize=(10,5))
sns.lineplot(data=age_stats, x='age', y='avg_amt', color='steelblue')
plt.title('Average Transaction Amount by Age')
plt.xlabel('Age')
plt.ylabel('Average Amount')
plt.grid(True, linestyle ='--', alpha=0.5)
plt.show()
No description has been provided for this image

Graph 19 - Average Transaction Amount by Age

The visualization above shows the average transaction amount for each cardholder age

  1. Early adulthood (15-25 years old):

    • Spending is lower and unstable, with noticeable fluctuations
    • This group likely includes students or early-career individuals, making smaller, inconsistent purchases
    • The few very young ages (below 18) are rare outliers and may explain the short dip near age 20
  2. Prime working years (30-45 years old):

    • The highest average spending occurs in this range, peaking around the mid-30s to early 40s (~$80-83 per transaction)

    • This reflects typical increased financial capacity and more frequent high-value transactions, consistent with income growth and family-related expenses

  3. Middle to older adults (50-70 years old):

    • The average amount gradually declines, stabilizing around $65-70

    • This suggests reduced purchase frequency or more controlled spending habits as individuals age

  4. Seniors (70+ years old):

    • Spending remains moderate but erratic, likely due to smaller sample sizes and synthetic noise in later ages.

    • The absence of a clear upward or downward trend beyond 80 supports that the data is synthetic but statistically stable

In [80]:
plt.figure(figsize=(10,5))
sns.lineplot(data=age_stats, x='age', y='fraud_rate', color='firebrick')
plt.title('Fraud Rate by Age (%)')
plt.xlabel('Age')
plt.ylabel('Fraud Rate (%)')
plt.grid(True, linestyle ='--', alpha=0.5)
plt.show()
No description has been provided for this image

Graph 20 - Fraud Rate by Age (%)

Several distinct patterns emerge from the following graph:

  1. Overall fraud rates remain low

    • Across nearly all ages, the fraud rate fluctuates between 0.3% and 1%, with a few random spikes. This confirms that the dataset maintains a realistic fraud prevalence consistent with typical financial data
  2. Younger users (under 25)

    • Fraud levels are noisier and occassionally spike (around age 18-20), likely due to small sample size and low transaction volume among minors and new credit users

    • These brief peaks are not meaningful behavioral signals

  3. Prime working age (25 - 60)

    • Fraud rate is relatively stable, hovering around 0.5-0.8%

    • This suggests a balanced risk distribution , no single adult age range is disproportionately targeted

    • Consistent fraud levels here likely reflect well-distributed spending and exposure across this demographic

  4. Older adults (70+)

  • There is a slight upward drift and more volatility starting around age 70, with peaks exceeding 1.5 - 1.8%

  • This may correspond to lower digital literacy, less frequent account monitoring or targeted fraud attempts - a pattern that, in real-world data, often reflects heightened vulnerability among elderly populations.

  • However, since this dataset is synthetically generated, these fluctuations might also arise from random noise rather than genuine behavioral effects

  • To verify this, we will statistically test whether the higher fraud rate observed in older age groups represents a consistent pattern or simply a random variance caused by limited sample size

Statistical test:

We will conduct a two-proportion Z-test to determine whether the increase in fraud prevalence among elderly cardholders is merely random variance in the dataset, or real pattern in the dataset. We will compare the fraud rate of elderly users (≥70 years) against that of all younger users (<70 years)

In [81]:
# Define elderly threshold (70 years and above)
elderly_mask = df_train['age'] >= 70

# Fraud counts
fraud_elderly = df_train.loc[elderly_mask, 'is_fraud'].sum()
fraud_non_elderly = df_train.loc[~elderly_mask, 'is_fraud'].sum()

# Sample sizes
n_elderly = elderly_mask.sum()
n_non_elderly = (~elderly_mask).sum()

# Run two-proportion z-test
count = np.array([fraud_elderly, fraud_non_elderly])
nobs = np.array([n_elderly, n_non_elderly])
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.5f}")

# Calculate fraud rates for reference
fraud_rate_elderly = (fraud_elderly / n_elderly) * 100
fraud_rate_non_elderly = (fraud_non_elderly / n_non_elderly) * 100

print(f"Fraud rate (Elderly 70+): {fraud_rate_elderly:.3f}%")
print(f"Fraud rate (Non-Elderly <70): {fraud_rate_non_elderly:.3f}%")
Z-statistic: 13.324
P-value: 0.00000
Fraud rate (Elderly 70+): 0.832%
Fraud rate (Non-Elderly <70): 0.548%

The Z-test returned a z = 13.32 and p < 0.000001, confirming that the elderly group exhibits a statistically higher fraud rate (0.83%) compared to younger users (0.55%). This indicates that the apparent increase is not random noise, but a systematic pattern within the dataset, potentially reflecting realistic demographic vulnerability or an intentional behavior embedded in the data generator

The derived age variable shows excellent structural quality and meaningful behavioral variation. Age influences both spending patterns and fraud exposure, with the middle-aged group (30-45) showing the highest spending intensity and elderly users (70+) exhibiting a statistically significant increase in fraud rate. While the dataset is synthetic, the relationship between age and fraud aligns with plausible real-world dynamics, suggesting that the data generator encoded age-dependent spending and risk behavior realistically. Therefore, age can be confidently retained as a valuable predictive feature, both for behavioral segmentation and for improving the model's ability to detect fraud across demographic groups

😵 is_fraud¶

The is_fraud feature is the target label indicating whether a transaction is fraudulent (1) or legitimate (0)

In [82]:
fraud_counts = df_train['is_fraud'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(fraud_counts, labels=['Not Fraud (0)', 'Fraud (1)'], autopct='%1.2f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title("Distribution of Fraudulent vs. Non-Fraudulent Transactions")
plt.show()
No description has been provided for this image

Graph 21 - Distribution of Fraudulent vs. Non-Fraudulent Transactions

The dataset shows a strong class imbalance: nearly all transactions are non-fraudulent, while only a tiny fraction (≈0.5%) represent actual fraud

This mirrors real-world financial datasets, where fraudulent transactions are rare but high-impact events

Such imbalance poses a serious modeling challenge:

  • Models trained on raw data may default to predicting non-fraud to achieve high accuracy but low recall

  • Consequently, they may fail to detect rare frauds, which are the most critical to identify

To address this, during the modeling phase, we will consider:

  • Resampling techniques such as SMOTE (Synthetic Minority Oversampling Technique)

  • Cost-sensitive learning, by assigning higher class weights to fraud cases

  • Evaluation metrics beyond accuracy, such as Precision, Recall, F1-Score and ROC-AUC, to ensure fair assessment under imbalance

Dropping helper features

During the EDA process, we created helper columns to build graphs and visualizations, as well as to conduct certain statistical tests for data integrity. These columns do not contribute to the training process and are redundant. However, we will still use some features for the sake of feature engineering in the later sections, therefore, we will keep a raw copy of the original dataset before dropping the columns:

In [83]:
df_train.columns
Out[83]:
Index(['cc_num', 'merchant', 'category', 'amt', 'gender', 'street', 'city',
       'state', 'lat', 'long', 'city_pop', 'job', 'merch_lat', 'merch_long',
       'is_fraud', 'hour', 'day_of_week', 'month', 'year', 'date', 'suffix',
       'city_norm', 'distance_cardholder_merchant', 'distance_group', 'age'],
      dtype='object')
In [84]:
drop_cols = [
    'suffix',
    'lat',
    'long',
    'merch_lat',
    'merch_long',
    'city_norm',
    'street', # Was used for the purpose of EDA, but is redundant for training
    'year', # used for the purpose of EDA, but has only 2 values which are not sufficient enough for training
    'distance_group'
]
In [85]:
df_train.drop(columns=[col for col in drop_cols if col in df_train.columns], inplace=True )
df_test.drop(columns=[col for col in drop_cols if col in df_test.columns], inplace=True )

# Confirm the structure
print("Dropped helper columns. Remaining features:")
print(df_train.columns)
Dropped helper columns. Remaining features:
Index(['cc_num', 'merchant', 'category', 'amt', 'gender', 'city', 'state',
       'city_pop', 'job', 'is_fraud', 'hour', 'day_of_week', 'month', 'date',
       'distance_cardholder_merchant', 'age'],
      dtype='object')
In [86]:
# Copies for Feature Engineering
df_train_raw = df_train.copy()
df_test_raw = df_test.copy()

print("created df_train_raw and df_test_raw")

# For unsupervised learning, remove 'date' and any columns unsuitable for distance-based algorithms
df_train = df_train.drop(columns=['date'])
df_test = df_test.drop(columns=['date'])
created df_train_raw and df_test_raw

Unsupervised Learning: PCA, t-SNE, and Clustering¶

Beyond predictive modeling, unsupervised learning allows us to explore the dataset's intrinsic structure without relying on the target label.

By projecting high-dimensional transactions into lower dimensions, we can visualize how naturally the data forms clusters, and whether those clusters correspond to fraudulent behavior.

We focus on two complementary techniques:

Method Purpose Characteristics
PCA Linear dimensionality reduction Captures global variance; fast and interpretable
t-SNE Non-linear manifold learning Preserves local neighborhoods; excellent for revealing small clusters

We will also apply K-Means clustering to the transformed data to detect hidden groupings and assess their correspondence with the known fraud labels.

Encoding Strategy for Unsupervised Learning

Unsupervised methods are sensitive to how we turn raw columns into numbers. Distances and variances come from the encodings, so we want compact, leakage-free representations that don't explode dimensionality.

Encodings types that will be safe for our usage:

  1. One-Hot-Encoding: will be used for low-cardinality categoricals (like gender and category) because of its interpretability

  2. Frequency Encoding - mapping each category to its relative frequency in the dataset (for features like merchant, job, city, state and cc_num). The pro here is that it is stable, and it preserves global distribution signal

  3. Cyclical encodings for time - hour, day_of_week, month will turn into sin/cos pair transformations (which respect periodicity and work great with PCA/K-Means)

For supervised learning, we will introduce fraud rate encoding for the implementation of supervised models, but we will not use it here

Frequency Encoding:

In [ ]:
class FrequencyEncoder(BaseEstimator, TransformerMixin):
  """
  Encodes categorical features by their frequency (normalized counts),
  replacing the original categorical columns
  """
  def __init__(self, min_freq=0, normalize=True):
    self.min_freq = min_freq
    self.normalize = normalize
    self.freq_maps_ = {}

  def fit(self, X, y=None):
    X = pd.DataFrame(X).copy()
    for col in X.columns:
      counts = X[col].value_counts(normalize=self.normalize)
      if self.min_freq > 0:
        threshold = self.min_freq if self.normalize else int(self.min_freq)
        counts = counts[counts >= threshold]
      self.freq_maps_[col] = counts.to_dict()
    return self

  def transform(self, X):
    X = pd.DataFrame(X).copy()
    for col in X.columns:
      mapping = self.freq_maps_.get(col, {})
      X[col] = X[col].map(mapping).fillna(0)
    return X.values

Cyclical Encoding:

In [97]:
class CyclicalTimeEncoder(BaseEstimator, TransformerMixin):
  """
  Encodes periodic time features (like hour, day_of_week, month)
  into sine and cosine components to preserve their cyclical nature.
  Automatically handles both numeric and string inputs.
  """

  def __init__(self, period_map=None):
    self.period_map = period_map or {}
    self.feature_names_out_ = []

    # Defined mappings for text-based time features
    self.day_map = {
        'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3,
        'Friday': 4, 'Saturday': 5, 'Sunday': 6
    }

    self.month_map = {
        'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5,
        'June': 6, 'July': 7, 'August': 8, 'September': 9,
        'October': 10, 'November': 11, 'December': 12
    }

  def fit(self, X, y=None):
    self.columns_ = X.columns.tolist()
    return self

  def transform(self, X):
    X = pd.DataFrame(X).copy()
    result = pd.DataFrame(index=X.index)
    self.feature_names_out_ = []

    for col in X.columns:
      period = self.period_map.get(col, None)
      if period is None:
        raise ValueError(f"No period specified for column '{col}'")

      # Handle string-based days or months automatically
      if X[col].dtype == 'object':
        if col == 'day_of_week':
          X[col] = X[col].map(self.day_map)
        elif col == 'month':
          X[col] = X[col].map(self.month_map)
        X[col] = pd.to_numeric(X[col], errors='coerce')

      # Apply sin and cos transformation
      result[f"{col}_sin"] = np.sin(2 * np.pi * X[col] / period)
      result[f"{col}_cos"] = np.cos(2 * np.pi * X[col] / period)
      self.feature_names_out_.extend([f"{col}_sin", f"{col}_cos"])

    return result.values

  def get_feature_names_out(self, input_features=None):
    return np.array(self.feature_names_out_, dtype=object)

Small explanation:

  • We define the cyclical columns and their periods (e.g., 24 hours in a day, 7 days in a week, 12 months in a year)

  • The transformer computes sin(2πx / period) and cos(2πx / period) for each column

  • It outputs all of them as new numeric features, useful for PCA, clustering, etc.

Data Preparation:

In [ ]:
# Separate features (X) and target label (y)
X = df_train.drop(columns=['is_fraud'])
y = df_train['is_fraud'].astype(int)

# Define Column Groups
card_col = ['cc_num']
high_card_cols = ["merchant", "job", "city", "state"]
low_card_cols = ["gender", "category"]
time_cols = ["hour", "day_of_week", "month"]

exclude_cols = ["is_fraud"] + high_card_cols + low_card_cols + time_cols
num_cols = (
    df_train.select_dtypes(include=["int64", "float64", "int32", "float32"])
    .columns.difference(exclude_cols)
    .tolist()
)
print("Numeric Columns:", num_cols)
Numeric Columns: ['age', 'amt', 'cc_num', 'city_pop', 'distance_cardholder_merchant']
In [ ]:
# preprocessing transformer
preprocess_unsupervised = ColumnTransformer(
    transformers=[
        # Frequency encoding for high-cardinality categorical features
        ("freq_high", FrequencyEncoder(), high_card_cols),

        # Frequency encoding for card number (activity-based encoding)
        ("freq_card", FrequencyEncoder(), card_col),

        # One-hot encoding for low-cardinality categorical features
        ("onehot_low", OneHotEncoder(handle_unknown="ignore", sparse_output=False), low_card_cols),

        # Cyclical time encoding (hour, day_of_week, month)
        ("cyclical_time", CyclicalTimeEncoder(period_map={
            'hour': 24,
            'day_of_week': 7,
            'month': 12
        }), time_cols),

        # Min-Max scaling for continuous numeric features
        ("scaler", MinMaxScaler(), num_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
)

Applying The Transformer:

In [ ]:
X_prepared = preprocess_unsupervised.fit_transform(X)
print("Final transformed shape:", X_prepared.shape)
Final transformed shape: (1296675, 32)
〽 PCA¶

Principal Component Analysis (PCA) is applied to uncover the dominant directions of variance in the dataset and to evaluate how many components are sufficient to represent most of its information.

By projecting the data into orthogonal axes that capture maximal variance, we can identify the intrinsic dimensionality of the transaction space and prepare for lower-dimensional visualization or clustering:

In [ ]:
# Initialize PCA
pca = PCA(n_components=None, random_state=42)
X_pca_full = pca.fit_transform(X_prepared)

# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# Plot
plt.figure(figsize=(8,5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.title("Explained Variance by Principal Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.tight_layout()
plt.show()

# Show first few component contributions
for i, var in enumerate(explained_variance_ratio[:5], 1):
  print(f"Component {i}: {var:.4f} variance explained")
print(f"\nTotal variance explained by first 2 components: {cumulative_variance[1]:.2%}")
No description has been provided for this image
Component 1: 0.1193 variance explained
Component 2: 0.1171 variance explained
Component 3: 0.1121 variance explained
Component 4: 0.1111 variance explained
Component 5: 0.1081 variance explained

Total variance explained by first 2 components: 23.65%

Graph 22 - Explained Variance by Principal Components

The first two components explain approximately 23.6% of the total variance, meaning that a 2D projecting captures about one-quarter of the overall data structure, sufficient for visual inspection but not full reconstruction.

The cumulative variance rises sharply across the first few components, reaching about 55%-60% by the 5th component, and flattens around the 7-10 components, where most of the meaningful variance has already been captured. Beyond roughly 20 components, the gain in explained variance becomes negligible, indicating that the majority of variability in the encoded dataset can be effectively represented in a 10-20 dimensional subspace instead of the full 32 dimensions

In [ ]:
# Reduce to 2D
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_prepared)

# plot
plt.figure(figsize=(8,6))
plt.scatter(
    X_pca_2d[:, 0],
    X_pca_2d[:, 1],
    c=y, cmap='coolwarm', s=2, alpha=0.6
)
plt.title("PCA Projection (2 Components) - Fraud vs. Non Fraud")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label='is_fraud (0 = Non-Fraud, 1 = Fraud)')
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 23 - PCA projection

The 2D PCA projection displays transactions according to their principal component coordinates, where each point represents a single transaction

  • Blue points correspond to legitimate transactions (is_fraud = 0)
  • Red points represent fraudulent transactions (is_fraud = 1)

Each axis (Principal Component 1 and 2) captures the directions of highest variance in the dataset after encoding and scaling, essentially, the two most informative linear combinations of all numerical, categorical and cyclical features.

Based on the visualization above, we can conclude the following:

  1. The overall pattern forms distinct horizontal bands or stripes, which arise due to dominant structured variables such as:

    • Cyclical time features (hour, day_of_week, month) that repeat periodically

    • Categorical encodings (category, state, job) that introduce discrete variance steps

    This structured appearance means that data is highly organized and not random. Transactions exhibit repetitive behavioral patterns (e.g., daily purchasing cycles, consistent merchant categories, or recurring spending behavior).

  2. The fraudulent transactions (red points) are sparse and dispersed throughout the legitimate clusters. They do not form any clear or isolated cluster, instead, they blend into the dense blue regions. This indicates that:

  • Fraudulent activity does not create a distinct high-variance direction that PCA can easily separate

  • Fraud behavior is embedded within legitimate transaction space, mimicking normal user patterns

This supports many of the observations we have witnessed in the feature exploration, where fraudulent transaction were intentionally designed to look legitimate.

We know that the first two components capture only 23.6% of the total variance in the dataset. While this is enough to provide a broad visualization of transaction patterns, it does not represent the full complexity of the data. Therefore, this 2D projection should be seen as a compressed illustration, not as a complete separation of behavioral dynamics.

💡 Note:

The fact that fraudulent and non-fraudulent transactions overlap heavily in this projection, suggests that fraud cannot be linearly separated in the feature space. This highlights the need for **non-linear techniques*** such as t-SNE

➿ t-SNE¶

While PCA captures global, linear variance, it may miss subtle local relationships hidden within the high-dimensional feature space. To uncover these non-linear patterns, we apply t-SNE (t-Distributed Stochastic Neighbor Embedding) a non-linear manifold learning technique designed to preserve local neighborhoods, meaning, points that are close in high-dimensional space remain close in the 2D embedding.

This would make t-SNE particularly effective for visualizing the structure, subtle clusters, outliers, and features that are often missed by PCA, especially when fraudulent transactions represent small, context-specific anomalies hidden within legitimate activity

In [ ]:
# Because t-SNE is computationally expensive, we take a sample
"""
NOTE: Sampling does not distort overall structure because the dataset is large
and well-distributed. 10,000 points are sufficient to approximate global behavior
while keeping runtime manageable
"""

sample_size = 10000 # can be adjusted if we need finer detail
X_sample = X_prepared[:sample_size]
y_sample = y[:sample_size]

# Initialize t-SNE
start_time = time.time()
tsne = TSNE(
    n_components=2,
    perplexity=70, # balances local/global structure
    learning_rate=300, # moderate, prevents local "worming"
    max_iter=1500, # allows convergence
    init='pca', # smoother start, perserves global layout
    random_state=42,
    verbose=1
)
X_tsne = tsne.fit_transform(X_sample)
print(f"t-SNE completed in {time.time() - start_time:.2f} seconds")
[t-SNE] Computing 211 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.001s...
[t-SNE] Computed neighbors for 10000 samples in 2.259s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.671876
[t-SNE] KL divergence after 250 iterations with early exaggeration: 75.412872
[t-SNE] KL divergence after 1500 iterations: 0.725678
t-SNE completed in 182.67 seconds
In [ ]:
# Visualization
plt.figure(figsize=(8,6))
plt.scatter(
    X_tsne[:, 0],
    X_tsne[:, 1],
    c=y_sample,
    cmap='coolwarm',
    s=5,
    alpha=0.6
)
plt.title("t-SNE Projection (2 Components) - Fraud vs. Non-Fraud")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.colorbar(label="is_fraud (0 = Non-Fraud, 1 = Fraud)")
plt.tight_layout
plt.show()
No description has been provided for this image

Graph 24 - t-SNE projection

The visualization above shows the t-SNE projection of 10,000 randomly sampled transactions:

  • Blue points remain the legitimate transactions (is_fraud = 0)

  • Red points remain the fraudulent transactions (is_fraud = 1)

Each point corresponds to a transaction embedded in a 2D space that preserves local similarity from the original 32-dimensional feature space.

The resulting map forms a series of compact, rounded clusters, each representing transactions that share similar behavioral or contextual properties. For example, purchases from similar merchant types, time patterns, or geographical regions.

This clustered but continuous stucture indicates that transaction behaviors are highly organized, reflecting consistent real-world patterns such as daily routines or repeated merchant interactions.

Fraudulent transactions are scattered within these clusters, showing no isolated or unique grouping. Instead, they are interspersed among legitimate data points, implying the same we've observed in the PCA analysis - fraudulent behavior closely mimics normal transactional patterns, at least within certain contexts.

Therefore, while t-SNE helps confirm that the dataset exhibits strong natural structure, it also highlights the subtle and embedded nature of fraud, justifying the need for supervised and non-linear models to effectively detect such hidden anomalies.

⭕ K-Means¶

K-Means is a simple yet powerful algorithm for discovering latent groupings within data. While PCA and t-SNE focus on visualization, K-Means explicitly partitions the dataset into k clusters, minimizing within-cluster variance.

We would want to identify potential behavioral clusters that capture recurring transaction patterns, and examine whether fraudulent transactions concentrate in any cluster or are spread throughout - which helps us understand the nature of fraudulent behavior.

In [ ]:
# We'll work on a sample (Since K-Means scales poorly with millions of points)
sample_size = 20000
X_sample = X_prepared[:sample_size]
y_sample = y[:sample_size]

# determine optimal number of clusters using elbow method
inertias = []
K_range = range(2,21)

for k in K_range:
  kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
  kmeans.fit(X_sample)
  inertias.append(kmeans.inertia_)

plt.figure(figsize=(7,4))
plt.plot(K_range, inertias, marker='o')
plt.title("Elbow Method - Optimal Number of Clusters (K)")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 25 - Elbow Method (K-Means)

The extended elbow analysis up to K=20 shows a smooth, monotonically decreasing inertia curve, with no abrupt "elbow" point.

However, the rate of improvement flattens notably beyond K ≈ 10-12, suggesting that most of the structural variance in the data is captured by this range. Increasing K beyond 12 yields only marginal gains, indicating diminishing returns and potential over-segmentation.

Therefore, we consider K = 10 as a practical trade-off between cluster compactness and interpretability.

In [ ]:
kmeans_final = KMeans(n_clusters=10, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_sample)

# Evaluate
silhouette_avg = silhouette_score(X_sample, cluster_labels)
print(f"Average Silhouette Score: {silhouette_avg:.3f}")

# Add cluster assignments to the data
df_clusters = pd.DataFrame({
    'cluster': cluster_labels,
    'is_fraud': y_sample
})
Average Silhouette Score: 0.137

The K-Means model achieved an average silhouette score of 0.137, indicating weakly separated clusters with substantial overlap.

This low score suggests that while some latent structure exist in the transaction space, the boundaries between clusters are not well-defined, as expected in financial data where fraudulent behavior is intentionally blended within legitimate activity patterns.

Fraud Distribution per Cluster:

In [ ]:
fraud_ratio = df_clusters.groupby('cluster')['is_fraud'].mean().sort_values(ascending=False)
fraud_counts = df_clusters.groupby('cluster')['is_fraud'].sum()
total_counts = df_clusters['cluster'].value_counts().sort_index()

fraud_summary = pd.DataFrame({
    'Total Transactions': total_counts,
    'Fraudulent Transactions': fraud_counts,
    'Fraud Ratio (%)': (fraud_ratio * 100).round(3)
})

fraud_summary['Fraud Ratio (%)'].plot(kind='bar', figsize=(8,4), color='tomato', alpha=0.7)
plt.title("Fraud Ratio (%) Across K-Means Clusters")
plt.xlabel("Cluster")
plt.ylabel("Fraud Ratio (%)")
plt.grid(True, axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 26 - Fraud ratio across K-Means Clusters

The chart above shows that fraudulent transactions are unevenly distributed across the 10 clusters. Clusters 0 and 1 exhibit notably higher fraud ratios (around 1% - 1.75%), while most others remain below 0.5%.

This pattern suggests that ceratin behavioral groups carry elevated fraud risk, yet no cluster is dominated by fraud, reinforcing what we've seen in PCA and t-SNE sections.

Conclusion¶

Unsupervised exploration revealed no clear separation between fraudulent and legitimate activity. PCA and t-SNE showed that fraud cases are deeply embedded within normal patterns, while K-Means clustering confirmed only weak separability. These findings highlight that fraud detection in this dataset requires supervised, non-linear modeling capable of capturing the subtle and context-dependent signals hidden in legitimate behavior.

EDA Conclusion¶

The dataset is clean, diverse, and behaviorally rich, providing a strong foundation for fraud detection modeling. Although fraud cases are rare, they exhibit distinct temporal, monetary, and categorical patterns, particularly acrosss time-of-day, transaction amount, and merchant category features.

Key predictive drivers include amt, temporal features like hour, day_of_week, month, and merchant context, while demographic and geographic variables add complementary insights into spending diversity and fraud exposure.

Features such as gender and job require cautious use to avoid bias or overfitting due to sparsity, and redundant features have been safely removed.

Overall, the dataset demonstrates excellent structure, and realistic behavioral consistency, we are ready to move on to the next section, where we explore additional features that we can engineer to further improve training of different supervised models

Feature Engineering¶

Feature engineering is the process of creating, transforming, or selecting variables in order to improve the predictive power of machine learning models.

In our analysis, we've already identified which raw fields are useful and how they can be transformed into meaningful signals for fraud detection.

However, we can still apply some feature engineering to create even better features that will enrich the data and make it more useful for the case of the training of the model

Let us first see the linear correlation between the features we have in the dataset. It will allow us to better understand which features are more reliable for our cause and which are redundant

In [87]:
# Keep only numeric columns for correlation
corr = df_train.corr(numeric_only=True)

plt.figure(figsize=(10, 6))
plt.imshow(corr, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label="Correlation")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Heatmap of Numeric Features")
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 27 - Correlation Heatmap

The correlation matrix shows that most numeric features in the dataset are weakly correlated with each other, meaning they provide largely independent information, which is good for machine learning models.

Let us now engineer a few more features that will add extra robustness to the dataset:

📓 card_had_prev_fraud¶

This is a new feature that indicates whether the credit card involved in the current transaction has ever been used in a fraudulent transaction before (based on past data).

For each card, transactions are sorted chronologically, and the feature is set to True only if the same card had a confirmed fraud prior to this transaction - never using any future information. This ensures the feature is completely time-safe and free from data leakage.

We include this feature because cards with a known fraud history are much more likely to commit fraud again, making this a strong behavioral signal for the model. It helps the algorithm capture repeat-offender patterns that are not easily reflected by other transactional or demographic attributes.

In [88]:
df_train = df_train_raw.sort_values(['cc_num', 'date']).copy()

df_train['card_had_prev_fraud'] = (
    df_train.groupby('cc_num')['is_fraud']
    .transform(lambda x: x.shift().cummax().fillna(0))
    .astype(bool)
)

fraud_cards_train = set(df_train.loc[df_train['is_fraud'] == 1, 'cc_num'])

df_test['card_had_prev_fraud'] = df_test['cc_num'].isin(fraud_cards_train)

In order to verify that the new feature was created correctly, we will add a sanity check:

For every transaction where card_had_prev_fraud == True, we verify that there is indeed a prior fraudulent transaction for the same card. Similarly, we confirm that no card is marked as having no prior frauds when in fact it does.

This step helps guarantee that the feature was implemented correctly and that no data leakage or logical inconsistencies were introduced during feature creation.

In [89]:
# Ensure chronological order per card
df_train = df_train.sort_values(['cc_num', 'date']).copy()

# Initialize counters
errors_flagged = 0

# Loop through each card and check consistency
for card, group in df_train.groupby('cc_num'):
    # Compute the true "previous fraud" flag from scratch
    true_prev_fraud = group['is_fraud'].shift().cummax().fillna(0).astype(bool)

    # Compare to our feature
    if not (true_prev_fraud == group['card_had_prev_fraud']).all():
        errors_flagged += 1
        print(f"Inconsistency found for card: {card}")
        display(pd.concat([group[['date', 'is_fraud', 'card_had_prev_fraud']],
                           true_prev_fraud.rename('true_prev_fraud')], axis=1).head(10))

if errors_flagged == 0:
    print("All cards consistent: every 'card_had_prev_fraud' flag is correct.")
else:
    print(f"{errors_flagged} cards had mismatched flags — investigate above.")
All cards consistent: every 'card_had_prev_fraud' flag is correct.

We can see that all cards are consistent, which is a great sign. Let us now evaluate how useful the feature really is:

Feature Observation

In [90]:
fraud_rate_by_flag = df_train.groupby('card_had_prev_fraud')['is_fraud'].mean().reset_index()
fraud_rate_by_flag['is_fraud'] *= 100 # Convert to %
fraud_rate_by_flag.rename(columns={'is_fraud' : 'Fraud Rate (%)'}, inplace=True)

plt.figure(figsize=(5,4))
plt.bar(
    fraud_rate_by_flag['card_had_prev_fraud'].astype(str),
    fraud_rate_by_flag['Fraud Rate (%)'],
    color=['#4C72B0', '#C44E52']
)
plt.title("Fraud Rate by Card's Fraud History")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Card Had Previous Fraud")
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 28 - Fraud Rate by Card's Fraud History

The finding is very encouraging. Despite the dataset's extreme class imbalance, the card_had_prev_fraud feature shows a massive behavioral separation: cards with prior fraud history exhibit a fraud rate of ~1.3%, nearly ten times higher than normal cards. This is an extraordinary signal given how rare fraud is overall. In practical terms, the model can instantly identify a subset of transactions where the baseline fraud probability skyrockets, a "high-alert" flag that mimics real-world risk scoring systems used by banks and big corporations

🌡 card_prev_fraud_ratio¶

While card_had_prev_fraud provides a binary signal indicating whether a card has ever been involved in fraud, it does not capture how frequently fraudulent behavior has occurred relative to the card's overall activity.

The feature card_prev_fraud_ratio addresses this limitation by representing the proportion of previous fraudulent transactions out of all past transactions for a given card.

This ratio gives the model a more nuanced, continuous measure of risk, cards with one fraud in 100 transactions behave differently from those with one in three.

In [91]:
df_train = df_train.sort_values(['cc_num', 'date']).copy()

# Count prior frauds and transactions per card
df_train['prev_fraud_count'] = df_train.groupby('cc_num')['is_fraud'].cumsum().shift().fillna(0)
df_train['prev_txn_count'] = df_train.groupby('cc_num').cumcount()

# Ratio of prior frauds
df_train['card_prev_fraud_ratio'] = df_train['prev_fraud_count'] / df_train['prev_txn_count'].replace(0, 1)

# Compute fraud ratio per card from training data only
card_stats = (
    df_train.groupby('cc_num')['is_fraud']
    .agg(['sum', 'count'])
    .rename(columns={'sum': 'train_fraud_count', 'count': 'train_total_count'})
)

card_stats['card_prev_fraud_ratio'] = card_stats['train_fraud_count'] / card_stats['train_total_count']

# Merge into test with the same column name
df_test = df_test.merge(card_stats[['card_prev_fraud_ratio']], on='cc_num', how='left')
df_test['card_prev_fraud_ratio'] = df_test['card_prev_fraud_ratio'].fillna(0)

# Drop 'prev_txn_count' and 'card_prev_fraud_ratio' from train
df_train.drop(columns=['prev_txn_count', 'prev_fraud_count', 'date'], inplace=True)

Feature Observation

In [ ]:
plt.figure(figsize=(8, 5))
sns.boxplot(
    data=df_train,
    x='is_fraud',
    y='card_prev_fraud_ratio',
    hue='is_fraud',
    palette=['#4CAF50', '#E53935'],
    legend=False
)
plt.yscale('log')
plt.title("Distribution of card_prev_fraud_ratio by Fraud Label (Log Scale)", fontsize=14)
plt.xlabel("Fraud Label (0 = Legit, 1 = Fraud)", fontsize=12)
plt.ylabel("Card Previous Fraud Ratio (log scale)", fontsize=12)
plt.grid(True, linestyle="--", alpha=0.4)
plt.show()
No description has been provided for this image

Graph 29 - Distribution of card_prev_fraud_ratio

The boxplot shows that transactions labeled as fraud (1) tend to have noticeably higher previous-fraud ratios than legitimate ones (0), even after applying a logarithmic scale.

This pattern suggests that cards with a history of fraudulent behavior are more likely to be used in new fraudulent transactions. Therefore, card_prev_fraud_ratio is a meaningful and predictive feature that provides strong behavioral signal

⏰ Temporal Flag Features¶

To capture behavioral patterns tied to time, we created a utility function:

In [92]:
def add_time_flags(df):
    if 'hour' not in df.columns or 'day_of_week' not in df.columns:
        raise KeyError("DataFrame must contain 'hour' and 'day_of_week' columns")

    # Convert to numeric
    df['hour'] = pd.to_numeric(df['hour'], errors='coerce').fillna(-1)

    # Handle 'day_of_week' as text or numeric
    if df['day_of_week'].dtype == 'object':
      # Map weekday names to numbers
      day_map = {
          'Monday': 0, 'Tuesday' : 1, 'Wednesday' : 2,
          'Thursday' : 3, 'Friday' : 4, 'Saturday' : 5,
          'Sunday' : 6
      }
      df['day_of_week_num'] = df['day_of_week'].map(day_map)
    else:
      df['day_of_week_num'] = pd.to_numeric(df['day_of_week'], errors='coerce')

    # Create flags
    df['is_night'] = np.where((df['hour'] >= 22) | (df['hour'] < 6), 1, 0)
    df['is_weekend'] = np.where(df['day_of_week_num'] >= 5, 1, 0)
    df.drop(columns=['day_of_week_num'], inplace=True)

    return df
In [93]:
df_train = add_time_flags(df_train)
df_test = add_time_flags(df_test)

This function adds two binary flag features:

  • is_night - since night-time activity may indicate higher fraud risk (as seen in the EDA)

  • is_weekend - weekend spending patterns often differ from weekday activity, which might help the model detect unusual transaction timing.

By incorporating these temporal flags, the model can learn contextual cues about transaction timing, which frequently improves fraud detection performance.

Features Observation

In [ ]:
# Compute fraud rate by each flag
night_fraud_rate = df_train.groupby('is_night')['is_fraud'].mean().reset_index()
weekend_fraud_rate = df_train.groupby('is_weekend')['is_fraud'].mean().reset_index()

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Fraud Rate by Night vs. Day
sns.barplot(
    data=night_fraud_rate,
    x='is_night', y='is_fraud',
    hue='is_night',
    palette={0: '#4CAF50', 1: '#E53935'},
    legend=False,
    ax=axes[0]
)
axes[0].set_title("Fraud Rate by Night vs. Day", fontsize=14)
axes[0].set_xlabel("Is Night", fontsize=12)
axes[0].set_ylabel("Fraud Rate (%)", fontsize=12)
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(["Day", "Night"])
axes[0].grid(True, linestyle="--", alpha=0.4)

# Annotate fraud rates on top of bars
for i, row in night_fraud_rate.iterrows():
    axes[0].text(
        i,
        row['is_fraud'] + 0.0002,
        f"{row['is_fraud']:.2%}",
        ha='center',
        va='bottom',
        fontsize=10
    )

# Fraud Rate by Weekend vs. Weekday
sns.barplot(
    data=weekend_fraud_rate,
    x='is_weekend', y='is_fraud',
    hue='is_weekend',
    palette={0: '#4CAF50', 1: '#E53935'},
    legend=False,
    ax=axes[1]
)
axes[1].set_title("Fraud Rate by Weekend vs. Weekday", fontsize=14)
axes[1].set_xlabel("Is Weekend", fontsize=12)
axes[1].set_ylabel("Fraud Rate (%)", fontsize=12)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(["Weekday", "Weekend"])
axes[1].grid(True, linestyle="--", alpha=0.4)

# Annotate fraud rates on top of bars
for i, row in weekend_fraud_rate.iterrows():
    axes[1].text(
        i,
        row['is_fraud'] + 0.0002,
        f"{row['is_fraud']:.2%}",
        ha='center',
        va='bottom',
        fontsize=10
    )

# Make y-axis show percentages instead of fractions
for ax in axes:
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))

plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 30 - Fraud Rate Based on New Temporal Features

The temporal analysis shows that is_night is a strong indicator of fraudulent activity, with night transactions being approximately fifteen times more likely to be fraudulent compared to daytime ones.

In contrast, is_weekend shows little distinction between fraud and legitimate transactions, suggesting that the timing of the day is a far stronger fraud signal than the day of the week. However, it might still matter in interaction terms (for instance, weekend and night transactions could have a specific risk pattern)

Let us now observe the correlation matrix again, to see whether the engineered features relate to the target variable/other features in any way:

In [ ]:
# Keep only numeric columns for correlation
corr = df_train.corr(numeric_only=True)

plt.figure(figsize=(10, 6))
plt.imshow(corr, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label="Correlation")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Heatmap of Numeric Features")
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph 31 - Updated Correlation Heatmap

The updated correlation matrix confirms that the engineered features contribute new, non-redundant information related to fraud detection, while the dataset remains structurally balanced and free from major multicollinearity issues

Conclusion¶

The feature engineering process enriched the dataset with several informative attributes that capture both temporal and behavioral aspects of transaction patterns.

Features such as card_prev_fraud_ratio and card_had_prev_fraud introduce valuable historical context, while is_night adds a meaningful temporal dimension that helps distinguish fraudulent behavior. At the same time, other variables like is_weekend and location-based measures contribute complementary perspectives, even if their direct correlation with fraud is weaker.

Having validated the importance and stability of these features, we are now ready to proceed to the model training and evaluation phase, where these variables will be leveraged to build predictive fraud detection models

Training Models¶

Preprocessing Stage¶

Before training our supervised models, we design a consistent and leakage-free preprocessing strategy that prepares every column appropriately according to its nature and predictive value.

Categorical features will be divided into three groups, each handled differently:

  • Low-cardinality features (category, gender) - just like in unsupervised learning, they will be One-Hot Encoded to preserve interpretability and allow models to learn clear group boundaries

  • High-cardinality features (merchant, job, city, state) will be fraud rate encoded, where each category is replaced with its average fraud rate in the training set. This provides target-aware information (categories more prone to fraud get higher values) while keeping dimensionality low.

  • Unique or identifier-like features (cc_num) will be frequency-encoded, representing how active each card is without leaking fraud labels. This will help models detect unusual card behavior (e.g., a card suddenly used far more often)

Numeric features (continuous variables such as amt, city_pop, age, ...) will be scaled using MinMaxScaler to bring all values into a uniform [0,1] range. This prevents features with larger numeric ranges (like population or amount) from dominating others during model training.

Temporal features will be encoded using sine and cosine transformations (as we have seen in unsupervised learning section). This will ensure that the model understand time as a continuous, circular variable.

Flag features (is_night, is_weekend, card_had_prev_fraud) are already binary indicators (0 or 1) and therefore require no further transformation. They will be passed through the pipeline as they are, because their numeric format is already suitable for model training

In [94]:
class FraudRateEncoder(BaseEstimator, TransformerMixin):
  """
  Encodes categorical features based on their historical fraud rate
  """
  def __init__(self, min_samples: int = 1, smoothing: float = 0.0, dtype: str = "float32"):
      self.min_samples = min_samples
      self.smoothing = smoothing
      self.dtype = dtype
      self.category_stats_ = None
      self.global_rate_ = None
      self.feature_name_out_ = None
      self._in_name = None

  def fit(self, X, y):
      # X is a single-column array/dataframe, y is is_fraud (0/1)
      x_series = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else pd.Series(X.ravel())
      y_series = pd.Series(y).astype(float)

      self._in_name = X.columns[0] if isinstance(X, pd.DataFrame) else 'col'

      # Group stats
      grp = pd.DataFrame({"x": x_series, "y": y_series}).groupby("x")["y"].agg(["mean", "count"])
      self.global_rate_ = y_series.mean()

      if self.smoothing > 0:
          # m-estimate smoothing toward global_rate
          smooth_num = grp["mean"] * grp["count"] + self.smoothing * self.global_rate_
          smooth_den = grp["count"] + self.smoothing
          rate = smooth_num / smooth_den
      else:
          rate = grp["mean"]

      # Enforce min_samples fallback to global
      rate = rate.where(grp["count"] >= self.min_samples, self.global_rate_)

      self.category_stats_ = rate
      self.feature_name_out_ = f"{self._in_name}_fraud_rate"
      return self

  def transform(self, X):
      x_series = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else pd.Series(X.ravel())
      out = x_series.map(self.category_stats_).fillna(self.global_rate_).astype(self.dtype)
      return out.to_numpy().reshape(-1, 1)

  def get_feature_names_out(self, input_features=None):
      return np.array([self.feature_name_out_], dtype=object)
In [95]:
class CardFrequencyEncoder(BaseEstimator, TransformerMixin):
  """
  Encodes card numbers (cc_num) by their transaction frequency in the dataset.
  This represents how active each card is, without leaking fraud information

  Unlike `FraudRateEncoder`, this encoder is label-agnostic and purely structural.
  It's ideal for 'cc_num' that uniquely identify entities
  """
  def __init__(self, new_col_name: str = "cc_freq"):
    self.new_col_name = new_col_name
    self.freq_map_ = None

  def fit(self, X, y=None):
    # Expecting a single column
    X = pd.DataFrame(X).copy()
    col = X.columns[0]
    self.freq_map_ = X[col].value_counts().to_dict()
    return self

  def transform(self, X):
    X = pd.DataFrame(X).copy()
    col = X.columns[0]
    out = X[col].map(self.freq_map_).fillna(0).astype("int64")
    return out.to_numpy().reshape(-1,1)

  def get_feature_names_out(self, input_features=None):
    return np.array([self.new_col_name], dtype=object)
In [98]:
low_card_cols = ["category", "gender"] # One-Hot
fraud_rate_cols = ["merchant", "job", "city", "state"] # Fraud Rate Encoding
card_col = ["cc_num"] # Frequency Encoding
num_cols = ["amt", "city_pop", "distance_cardholder_merchant", "age", "card_prev_fraud_ratio"] # Scaled numeric
time_cols = ["hour", "day_of_week", "month"] # Cyclical encoding

preprocess = ColumnTransformer(
    transformers=[
        # Fraud Rate Encoders for high-cardinality categoricals
        ("merchant_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["merchant"]),
        ("job_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["job"]),
        ("city_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["city"]),
        ("state_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["state"]),

        # Frequency Encoding for card number
        ("card_freq", CardFrequencyEncoder(new_col_name="cc_freq"), card_col),

        # One-Hot Encoding for low-cardinality categoricals
        ("onehot_low", OneHotEncoder(handle_unknown="ignore", sparse_output=False), low_card_cols),

        # Cyclical Encoding for time-based features
        ("cyclical_time", CyclicalTimeEncoder(period_map={
            "hour": 24,
            "day_of_week": 7,
            "month": 12
        }), time_cols),

        # Min-Max Scaling for numeric features
        ("scaler", MinMaxScaler(), num_cols)
      ],
    remainder="drop",
    verbose_feature_names_out=False
)

Next, we will use a seed setup to ensure identical sampling in SMOTE, and the same model weights initialization - so every rerun will give the same metrics:

In [99]:
def set_global_seed(seed: int = 42):
  """
  Sets random seed for reproducibility across Python, Numpy and PyTorch.
  Ensures deterministic behavior for CUDA when available
  """
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
In [100]:
set_global_seed(42)

To create comfortable, and neat visualizations for model comparisons, we will create helper functions:

compare_two_models

The function compares the results of 2 given models.

It focuses on recall, precision, F1 and ROC-AUC. The results given by the function will be analyzed and interpreted in the later stages of the section

In [101]:
def compare_two_models(y_true, preds_1, probs_1, preds_2, probs_2,
                       model_names=('Model 1', 'Model 2')):
    """
    Compares two models by given evaluation metrics,
    visualizes results as a bar chart
    """

    # Compute metrics
    metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']
    scores_1 = [
        precision_score(y_true, preds_1, zero_division=0),
        recall_score(y_true, preds_1, zero_division=0),
        f1_score(y_true, preds_1, zero_division=0),
        roc_auc_score(y_true, probs_1)
    ]
    scores_2 = [
        precision_score(y_true, preds_2, zero_division=0),
        recall_score(y_true, preds_2, zero_division=0),
        f1_score(y_true, preds_2, zero_division=0),
        roc_auc_score(y_true, probs_2)
    ]

    # Display table
    results_df = pd.DataFrame({
        'Metric': metrics,
        model_names[0]: np.round(scores_1, 3),
        model_names[1]: np.round(scores_2, 3)
    })
    print(results_df.to_string(index=False))
    print()

    # Bar chart
    x = np.arange(len(metrics))
    width = 0.35
    fig, ax = plt.subplots(figsize=(8, 5))

    bars1 = ax.bar(x - width/2, scores_1, width, label=model_names[0])
    bars2 = ax.bar(x + width/2, scores_2, width, label=model_names[1])

    # Annotate bars
    for bars in [bars1, bars2]:
        for bar in bars:
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
                    f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=9)

    ax.set_ylabel('Score')
    ax.set_ylim(0, 1.05)
    ax.set_xticks(x)
    ax.set_xticklabels(metrics)
    ax.set_title('Model Comparison: Key Metrics')
    ax.legend()
    ax.grid(axis='y', linestyle='--', alpha=0.6)

    plt.tight_layout()
    plt.show()

compare_three_models

The function compares the results of 3 given models (similarly to the previous helper function).

In [132]:
def compare_three_models(y_true,
                         preds_1, probs_1,
                         preds_2, probs_2,
                         preds_3, probs_3,
                         model_names=('Model 1', 'Model 2', 'Model 3')):
    """
    Compares three models by key evaluation metrics
    and visualizes results as a grouped bar chart.
    """

    # Define metrics
    metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']

    # Helper function to compute scores for each model
    def compute_scores(preds, probs):
        return [
            precision_score(y_true, preds, zero_division=0),
            recall_score(y_true, preds, zero_division=0),
            f1_score(y_true, preds, zero_division=0),
            roc_auc_score(y_true, probs)
        ]

    # Compute scores
    scores_1 = compute_scores(preds_1, probs_1)
    scores_2 = compute_scores(preds_2, probs_2)
    scores_3 = compute_scores(preds_3, probs_3)

    # Create DataFrame
    results_df = pd.DataFrame({
        'Metric': metrics,
        model_names[0]: np.round(scores_1, 3),
        model_names[1]: np.round(scores_2, 3),
        model_names[2]: np.round(scores_3, 3)
    })

    # Display table
    print(results_df.to_string(index=False))
    print()

    # Plot bar chart
    x = np.arange(len(metrics))
    width = 0.25
    fig, ax = plt.subplots(figsize=(9, 5))

    bars1 = ax.bar(x - width, scores_1, width, label=model_names[0])
    bars2 = ax.bar(x, scores_2, width, label=model_names[1])
    bars3 = ax.bar(x + width, scores_3, width, label=model_names[2])

    # Annotate bars
    for bars in [bars1, bars2, bars3]:
        for bar in bars:
            ax.text(bar.get_x() + bar.get_width()/2,
                    bar.get_height() + 0.015,
                    f'{bar.get_height():.2f}',
                    ha='center', va='bottom', fontsize=9)

    ax.set_ylabel('Score')
    ax.set_ylim(0, 1.05)
    ax.set_xticks(x)
    ax.set_xticklabels(metrics)
    ax.set_title('Model Comparison: Key Metrics')
    ax.legend()
    ax.grid(axis='y', linestyle='--', alpha=0.6)

    plt.tight_layout()
    plt.show()

plot_model_performance

The function visualizes the performance of trained model by focusing on the same key evaluation metrics as the previous helper function. It provides a clear overview of model behavior through a detailed confusion matrix and ROC curve, allowing intuitive assessment of classification accuracy and discriminative power

In [102]:
def plot_model_performance(y_true, y_pred, y_proba, model_name="Model"):
    """
    Plots the confusion matrix, ROC curve, and metric summary for a classification model.
    """

    # Compute metrics
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_proba)

    # Create 3 subplots
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Confusion matrix
    ConfusionMatrixDisplay.from_predictions(
        y_true,
        y_pred,
        ax=axes[0],
        cmap="Blues",
        colorbar=False
    )
    axes[0].set_title(f"Confusion Matrix ({model_name})", fontsize=12)

    # ROC curve
    RocCurveDisplay.from_predictions(
        y_true,
        y_proba,
        ax=axes[1],
        name=model_name
    )
    axes[1].plot([0, 1], [0, 1], "--", color="gray")
    axes[1].set_title("ROC Curve", fontsize=12)

    # Metrics Bar Chart
    metrics = ["Precision", "Recall", "F1-score", "ROC-AUC"]
    values = [precision, recall, f1, roc_auc]
    axes[2].bar(metrics, values, color=["#4c72b0", "#55a868", "#c44e52", "#8172b3"])
    axes[2].set_ylim(0, 1)
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance Metrics", fontsize=12)

    # Display exact values on top of bars
    for i, v in enumerate(values):
        axes[2].text(i, v + 0.02, f"{v:.3f}", ha="center", fontsize=10)

    plt.tight_layout()
    plt.show()

With the preprocessing pipeline fully established and all features consistently prepared, we are now ready to proceed to the model training phase

Models¶

In this stage, we focus on developing and evaluating a set of supervised machine learning models for fraud detection.

Here is the overview of the selected models:

Model How It Works Why We Use It
Logistic Regression A linear model that estimates the probability of fraud using a weighted combination of features and a sigmoid activation function. Serves as a baseline for performance comparison; simple, interpretable, and fast to train.
Random Forest An ensemble of decision trees built on random feature subsets, combining their outputs through majority voting. Handles non-linear relationships and feature interactions effectively while being robust to noise and overfitting.
XGBoost A gradient boosting algorithm that builds trees sequentially, each correcting the errors of the previous one. Known for high predictive accuracy, speed, and built-in regularization; ideal for structured data.
Neural Network A multi-layered model of interconnected neurons that learns complex patterns through non-linear transformations. Capable of capturing deep, non-linear relationships between features and generalizing across diverse patterns.
TabNetClassifier A deep learning architecture specifically designed for tabular data, using sequential attention to focus on the most relevant features at each decision step. Combines the strengths of deep learning with interpretable feature selection, often outperforming traditional models on structured data.

Each model is trained on the preprocessed dataset using the same feature transformations to ensure fairness and consistency across evaluations.

Given the extreme class imbalance in fraudulent transactions, we will apply SMOTE to oversample the minority class and improve the model's ability to recognize rare fraud cases.

Model performance will be assessed primarily using recall and ROC-AUC:

  • Recall is prioritized to minimize false negatives, as missing a fraudulent transaction carries the highest cost

  • ROC-AUC provides a comprehensive measure of each model's discriminative capability across various thresholds.

By comparing these models under identical conditions, we aim to identify the one that achieves the best balance between fraud detection sensitivity and overall predictive performance

🧰 Logistic Regression (Baseline)¶

Training¶

We start by training the baseline model using the original, imbalanced dataset to establish a performance benchmark.

Next, we apply SMOTE to generate synthetic samples for the minority class, thereby balancing the dataset.

Finally, the model is retrained on the resampled data, allowing us to compare results and assess the impact of class balancing on overall predictive performance

In [ ]:
# Version without SMOTE

# Define model
lg = LogisticRegression(random_state=42, max_iter=1000)

X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)


# Build pipeline
steps = [
    ("preprocess", preprocess),
    ("lg", lg)
]

pipe = Pipeline(steps)

pipe.fit(X_train, y_train)

# Predict on test set (no SMOTE here — only real test data)
y_pred_lg = pipe.predict(X_test)
y_proba_lg = pipe.predict_proba(X_test)[:, 1]
In [ ]:
# Evaluate performance
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lg))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lg))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_lg))
Confusion Matrix:
 [[553574      0]
 [  2145      0]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.00      0.00      0.00      2145

    accuracy                           1.00    555719
   macro avg       0.50      0.50      0.50    555719
weighted avg       0.99      1.00      0.99    555719

/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
ROC-AUC Score: 0.8537488669832314

💡 Note on warning:

During evaluation, several UndefinedMetricWarning messages were displayed by scikit-learn. These warnings indicate that the model did not predict any instances of the positive class (fraudulent transactions). In other words, since all predictions were labeled as non-fraud, there were no "positive" predictions to calculate precision or F1-score from, resulting in undefined metrics automatically set to zero.

This behavior is expected and logical in extremely imbalanced datasets, where the model initially learns to favor the dominant class (non-fraud) to minimize overall error. Once class imbalance is addressed, these warnings naturally disappear as the model begins predicting fraud cases correctly

Now we test the same model with SMOTE.

After carefully testing different SMOTE values, we found that increasing the minority class to 20% produces the best results.

In [ ]:
# Version with SMOTE

# Define model
lg = LogisticRegression(random_state=42, max_iter=10000)

# Define SMOTE
smote = SMOTE(
    sampling_strategy=0.2,  # minority class will be 20% of majority
    random_state=42,
    k_neighbors=5
)

# Build pipeline using imblearn.Pipeline
steps = [
    ("preprocess", preprocess),
    ("smote", smote),
    ("lg", lg)
]

pipe = Pipeline(steps)

# Fit the pipeline (SMOTE applied only on training since we used imblearn's pipeline)
pipe.fit(X_train, y_train)

# Predict on test set (no SMOTE here - only real test data)
y_pred_lg_smote = pipe.predict(X_test)
y_proba_lg_smote = pipe.predict_proba(X_test)[:, 1]
Results¶

without SMOTE:

In [ ]:
plot_model_performance(y_test, y_pred_lg, y_proba_lg, model_name="Baseline Logistic Regression")
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
No description has been provided for this image

with SMOTE:

In [ ]:
plot_model_performance(y_test, y_pred_lg_smote, y_proba_lg_smote, model_name="Logistic Regression (SMOTE)")
No description has been provided for this image
In [ ]:
compare_two_models(
    y_true=y_test,
    preds_1=y_pred_lg,
    probs_1=y_proba_lg,
    preds_2=y_pred_lg_smote,
    probs_2=y_proba_lg_smote,
    model_names=('Logistic Regression', 'Logistic Regression + SMOTE')
)
   Metric  Logistic Regression  Logistic Regression + SMOTE
Precision                 0.00                         0.19
   Recall                 0.00                         0.62
 F1-score                 0.00                         0.29
  ROC-AUC                 0.85                         0.93

No description has been provided for this image

Model Performance

Before applying SMOTE, the Logistic Regression model completely failed to detect fraudulent transactions.

The precision, recall, and F1-score for the fraud class were all zero, indicating that the model predicted every transaction as legitimate. This outcome highlights the impact of severe class imbalance, the model learned to favor the dominant non-fraud class while entirely ignoring the minority class.

After addressing this imbalance using SMOTE, the model's performance improved. The recall increased to 0.62, meaning the model correctly identified more than half of all fraudulent transactions, while the F1-score rose to 0.29, reflecting a more balanced trade-off between precision and recall.

Furthermore, the ROC-AUC improved from 0.85 to 0.92, confirming that the model's overall discriminative ability between fraudulent and legitimate transactions became stronger.

In summary, balancing the dataset with SMOTE substantially enhanced the model's sensitivity to fraudulent behavior. Although this approach introduced a few false positives, it represents a reasonable and aluable trade-off in fraud detection, where catching more frauds is often prioritized over perfect precision

🌳 Random Forest¶

We begin with a baseline Random Forest model trained on the original, highly imbalanced dataset. At this stage, the model relies on its inherent ability to handle imbalance through bootstrap aggregation and random feature selection, but no external balancing technique is applied.

Afterwards, we will perform hyperparameter optimization using GridSearchCV to identify the most effective combination of parameters (such as tree depth, number of estimators, and class weights) for improving detection performance.

Finally, we compare the optimized model's results to the baseline, focusing on the improvements of the model metrics

Baseline Training¶

Random Forest without SMOTE:

In [ ]:
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

rf = RandomForestClassifier(
    n_estimators=200,                 # more trees for stability
    max_depth=None,                   # let trees grow fully
    min_samples_split=5,              # prevent overfitting on minority class
    min_samples_leaf=2,               # same reason
    class_weight="balanced_subsample",# handle imbalance automatically
    random_state=42,
    n_jobs=-1                    # use all CPU cores
)

steps = [("preprocess", preprocess),
         ("rf", rf)]
pipe = Pipeline(steps)

# Fit model
pipe.fit(X_train, y_train)

# Predict
y_pred_rf = pipe.predict(X_test)
y_proba_rf = pipe.predict_proba(X_test)[:, 1]

Random Forest with SMOTE:

In [104]:
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

rf = RandomForestClassifier(
    n_estimators=200,                 # more trees for stability
    max_depth=None,                   # let trees grow fully
    min_samples_split=5,              # prevent overfitting on minority class
    min_samples_leaf=2,               # same reason
    class_weight="balanced_subsample",# handle imbalance automatically
    random_state=42,
    n_jobs=-1                    # use all CPU cores
)

smote = SMOTE(
    sampling_strategy=0.2,  # minority class will be 20% of majority
    random_state=42,
    k_neighbors=5
)

steps = [("preprocess", preprocess),
         ("smote", smote),
         ("rf", rf)]
pipe = Pipeline(steps)

# Fit model
pipe.fit(X_train, y_train)

# Predict
y_pred_rf_smote = pipe.predict(X_test)
y_proba_rf_smote = pipe.predict_proba(X_test)[:, 1]

# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf_smote))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf_smote))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_rf_smote))
Confusion Matrix:
 [[553296    278]
 [  1997    148]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.35      0.07      0.12      2145

    accuracy                           1.00    555719
   macro avg       0.67      0.53      0.56    555719
weighted avg       0.99      1.00      0.99    555719


ROC-AUC Score: 0.8486340518522304
Baseline Results¶
In [ ]:
plot_model_performance(y_test, y_pred_rf, y_proba_rf, model_name="Random Forest")
No description has been provided for this image
In [105]:
plot_model_performance(y_test, y_pred_rf_smote, y_proba_rf_smote, model_name="Random Forest (SMOTE)")
No description has been provided for this image
In [ ]:
compare_two_models(
    y_true=y_test,
    preds_1=y_pred_rf,
    probs_1=y_proba_rf,
    preds_2=y_pred_rf_smote,
    probs_2=y_proba_rf_smote,
    model_names=('Baseline Random Forest', 'Random Forest + SMOTE')
)
   Metric  Baseline Random Forest  Random Forest + SMOTE
Precision                    0.47                   0.35
   Recall                    0.07                   0.07
 F1-score                    0.12                   0.12
  ROC-AUC                    0.84                   0.85

No description has been provided for this image

Model Performance

from the results seen above, we can see that both models achieve a similar ROC-AUC of approximately 0.84, indicating that their overall ability to distinguish between fraudulent and legitimate transactions remains virtually unchanged. This suggests that applying SMOTE did not significantly affect the model's ranking capability.

However, precision decreased from 0.47 to 0.35 after applying SMOTE, implying that a higher proportion of predicted frauds are now false positives.

Meanwhile, recall remained constant at 0.07, showing that the model still detects only a small fraction of actual fraud cases despite the resampling.

As a result, the F1-score also remained stable at 0.12, reflecting no substantial improvement in the trade-off between precision and recall.

In summary, introducing SMOTE did not enhance fraud detection performance for the Random Forest model in this configuration. Although it slightly reduced precision, it failed to improve recall or overall discriminatory power.

These findings suggest that further optimization - such as hyperparameter tuning may be required to achieve meaningful gains in detecting fraudulent activity

Optimizing Parameters¶

Let us now apply GridSearchCV to tune the Random Forest parameters and select the combination that optimizes recall, aiming to improve fraud detection sensitivity

In [ ]:
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

# Parameter grid (kept small to avoid runtime disconnection)
param_grid = {
    "rf__n_estimators": [100],
    "rf__max_depth": [10, 20],          # 2 levels to test underfitting vs overfitting
    "rf__min_samples_split": [5],
    "rf__min_samples_leaf": [2],
    "rf__class_weight": ["balanced_subsample"]  # handle imbalance
}

smote = SMOTE(
    sampling_strategy=0.2,  # minority class will be 20% of majority
    random_state=42,
    k_neighbors=5
)


# Build pipeline
steps = [
    ("preprocess", preprocess),
    ("smote", smote),
    ("rf", RandomForestClassifier(random_state=42))
]
pipe = Pipeline(steps)

# Scoring & CV
recall_scorer = make_scorer(recall_score, pos_label=1)
cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=recall_scorer,
    cv=cv,
    n_jobs=-2,
    verbose=1,
    return_train_score=False
)

# Fit grid search
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Recall: {grid_search.best_score_:.4f}")

# Evaluate best model
best_model = grid_search.best_estimator_
y_pred_rf_best = best_model.predict(X_test)
y_proba_rf_best = best_model.predict_proba(X_test)[:, 1]
Fitting 2 folds for each of 2 candidates, totalling 4 fits
Best Parameters: {'rf__class_weight': 'balanced_subsample', 'rf__max_depth': 10, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}
Best Cross-Validation Recall: 0.8213
Optimized Results¶
In [ ]:
compare_two_models(
    y_true=y_test,
    preds_1=y_pred_rf,
    probs_1=y_proba_rf,
    preds_2=y_pred_rf_best,
    probs_2=y_proba_rf_best,
    model_names=('Baseline Random Forest', 'Tuned Random Forest')
)
   Metric  Baseline Random Forest  Tuned Random Forest
Precision                    0.47                 0.01
   Recall                    0.07                 0.09
 F1-score                    0.12                 0.02
  ROC-AUC                    0.84                 0.68

No description has been provided for this image

Model Performance

The comparison between the Baseline Random Forest and the Tuned Random Forest models reveals that hyperparameter optimization did not yield the expected improvements.

Although the tuning process specifically aimed to maximize recall, the results show only a marginal increase, from 0.07 to 0.09, while precision dropped sharply from 0.47 to 0.01, indicating that nearly all predicted fraud cases were false positives.

Furthermore, the ROC-AUC score declined from 0.84 to 0.68, suggesting that the tuned model lost much of its ability to effectively distinguish between fraudulent and legitimate transactions.

The F1-Score also decreased from 0.12 to 0.02, confirming a weaker overall balance between precision and recall.

In summary, despite focusing on improving recall, the tuning process failed to enhance the Random Forest's performance. Both versions of the model struggle to capture the subtle and complex patterns that characterize fraud. These findings suggest that Random Forest may not be the most suitable algorithm for this task, and that more powerful models, such as XGBoost or neural networks, may be necessary to achieve meaningful predictive performance

⚡XGBoost¶

After observing the limitations of Logistic Regression and Random Forest, we now turn to XGBoost, a model well-known for its robustness in handling complex, imbalanced datasets. Its gradient boosting framework allows it to learn subtle nonlinear patterns that simpler models often miss, which is a crucial advantage in detecting rare fraud cases.

Our goal is to assess whether XGBoost can improve recall without severely compromising precision, effectively capturing more fraudulent transactions while maintaining a reasonable false positive rate.

We begin with a baseline configuration to establish reference performance, followed by targeted hyperparamter tuning (adjusting tree depth, learning rate, and class weights) to enhance its sensitivity to fraud detection.

Baseline Training¶

The parameters below are chosen to balance learning stability, model complexity, and recall sensitivity:

  • n_estimators= 900: A relatively high number of trees allows the model to learn gradually and capture subtle fraud patterns, especially when combined with a low learning rate.

  • learning_rate= 0.03: A small learning rate slows down training and prevents overfitting, helping the model generalize better on unseen transactions.

  • max_depth= 6: Medium-depth trees strike a good balance, deep enough to model complex relationships, but not so deep that they memorize noise.

  • min_child_weight= 2: Requires at least a small number of samples in each leaf, which makes the model less likely to overfit to extremely rare or noisy cases.

  • subsample= 0.8 and colsample_bytree= 1.0: Row subsampling (80%) introduces randomness and improves robustness, while using all features per tree helps capture every relevant signal in the relatively small feature space.

  • gamma= 0.1: A light regularization term that prunes splits with minimal gain, keeping the model compact and efficient.

  • reg_lambda= 1 and reg_alpha= 0.1: L2 and L1 regularization terms that prevent overfitting by penalizing overly complex trees while maintaining flexibility to learn important interactions.

  • scale_pos_weight= (majority/minority): Adjusts the loss to give more importance to fraudulent samples.

  • tree_method= "hist": Uses a histogram-based algorithm that's optimized for large datasets, making training much faster without losing accuracy.

  • eval_metric= "aucpr": Precision-Recall AUC is more informative than ROC-AUC for imbalanced datasets, as it focuses on how well the model identifies frauds rather than just overall separation.

XGBoost baseline without SMOTE:

In [ ]:
# Split features and target
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

xgb = XGBClassifier(
    n_estimators=900,
    learning_rate=0.03,
    max_depth=6,
    min_child_weight=2,
    subsample=0.8,
    colsample_bytree=1.0,
    gamma=0.1,
    reg_lambda=1,
    reg_alpha=0.1,
    scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),
    n_jobs=-1,
    random_state=42,
    tree_method="hist",
    eval_metric="aucpr"  # better that ROC for imbalanced datasets
)



# Create pipeline
steps = [("preprocess", preprocess),
         ("xgb", xgb)]
pipe = Pipeline(steps)

# Fit model
pipe.fit(X_train, y_train)

# Predict
y_pred_xgb = pipe.predict(X_test)
y_proba_xgb = pipe.predict_proba(X_test)[:, 1]

# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_xgb))
Confusion Matrix:
 [[547546   6028]
 [  1695    450]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99    553574
           1       0.07      0.21      0.10      2145

    accuracy                           0.99    555719
   macro avg       0.53      0.60      0.55    555719
weighted avg       0.99      0.99      0.99    555719


ROC-AUC Score: 0.8605525873602048

XGBoost baseline model with SMOTE:

In [ ]:
# Split features and target
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

xgb = XGBClassifier(
    n_estimators=900,
    learning_rate=0.03,
    max_depth=6,
    min_child_weight=2,
    subsample=0.8,
    colsample_bytree=1.0,
    gamma=0.1,
    reg_lambda=1,
    reg_alpha=0.1,
    scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),
    n_jobs=-1,
    random_state=42,
    tree_method="hist",
    eval_metric="aucpr"  # better that ROC for imbalanced datasets
)



# Create pipeline
steps = [("preprocess", preprocess),
         ("smote", smote),
         ("xgb", xgb)]
pipe = Pipeline(steps)

# Fit model
pipe.fit(X_train, y_train)

# Predict
y_pred_xgb_smote = pipe.predict(X_test)
y_proba_xgb_smote = pipe.predict_proba(X_test)[:, 1]

# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_xgb))
Confusion Matrix:
 [[547546   6028]
 [  1695    450]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99    553574
           1       0.07      0.21      0.10      2145

    accuracy                           0.99    555719
   macro avg       0.53      0.60      0.55    555719
weighted avg       0.99      0.99      0.99    555719


ROC-AUC Score: 0.8605525873602048
Baseline Results¶
In [ ]:
compare_two_models(
    y_true=y_test,
    preds_1=y_pred_xgb,
    probs_1=y_proba_xgb,
    preds_2=y_pred_xgb_smote,
    probs_2=y_proba_xgb_smote,
    model_names=('XGBoost', 'XGBoost + SMOTE')
)
   Metric  XGBoost  XGBoost + SMOTE
Precision     0.07             0.07
   Recall     0.21             0.27
 F1-score     0.10             0.11
  ROC-AUC     0.86             0.87

No description has been provided for this image

Model Performance

The comparison between the baseline XGBoost model and the XGBoost trained with SMOTE shows only marginal improvements.

Applying SMOTE slightly increased recall from 0.21 to 0.27, indicating a modest gain in the model's ability to identify fraudulent transactions. However, precision remained unchanged at 0.07, meaning that most predicted frauds were still false positives.

The F1-score showed a minimal improvement from 0.1 to 0.11, and ROC-AUC increased slightly from 0.86 to 0.87, suggesting a small gain in overall discrimination but not a meaningful step forward in practical detection performance.

Overall, both XGBoost variants struggle to balance precision and recall effectively. Despite its reputation as high-performing algorithm for structured data, XGBoost underperformed in this setting, failing to capture the rare and complex fraud patterns present in the dataset.

Interestingly, the Logistic Regression model with SMOTE achieved a substantially higher recall, demonstrating that in highly imbalanced problems, simpler models can sometimes outperform more complex ones when paired with appropriate data balancing techniques.

Next Steps

We will try to improve the model's paramets to optimize recall. We will use Gridsearch to test and evaluate several parameters.

Optimizing Parameters¶

Let us now apply GridSearchCV to tune the XGBoost parameters and select the combination that optimizes recall, aiming to improve fraud detection sensitivity

In [ ]:
# Define parameter grid (relatively small grid to avoid disconnection from colab)
param_grid = {
    "xgb__max_depth": [5],
    "xgb__min_child_weight": [2, 5],
    "xgb__colsample_bytree": [0.8, 1.0],
}


# Recall scorer and CV setup
recall_scorer = make_scorer(recall_score, pos_label=1)
cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

# GridSearch
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=recall_scorer,
    cv=cv,
    verbose=2,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Extract best model
best_model = grid_search.best_estimator_
print("\nBest Parameters Found:")
for k, v in grid_search.best_params_.items():
    print(f"  {k}: {v}")

print(f"\nBest Cross-Validated Recall: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
y_pred_xg_best = best_model.predict(X_test)
y_proba_xg_best = best_model.predict_proba(X_test)[:, 1]

# Print evaluation metrics
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_xg_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xg_best))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_xg_best):.4f}")
Fitting 2 folds for each of 4 candidates, totalling 8 fits

Best Parameters Found:
  xgb__colsample_bytree: 0.8
  xgb__max_depth: 5
  xgb__min_child_weight: 5

Best Cross-Validated Recall: 0.9484

Confusion Matrix:
 [[540925  12649]
 [  1445    700]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99    553574
           1       0.05      0.33      0.09      2145

    accuracy                           0.97    555719
   macro avg       0.52      0.65      0.54    555719
weighted avg       0.99      0.97      0.98    555719

ROC-AUC: 0.8671
Optimized Results¶
In [ ]:
plot_model_performance(y_test, y_pred_xg_best, y_proba_xg_best, model_name="XGBoost (GridSearch Recall-Optimized)")
No description has been provided for this image
In [ ]:
compare_two_models(
    y_true=y_test,
    preds_1=y_pred_xg_best,
    probs_1=y_proba_xg_best,
    preds_2=y_pred_xgb_smote,
    probs_2=y_proba_xgb_smote,
    model_names=('Tuned XGBoost', 'XGBoost')
)
   Metric  Tuned XGBoost  XGBoost
Precision           0.05     0.07
   Recall           0.33     0.27
 F1-score           0.09     0.11
  ROC-AUC           0.87     0.87

No description has been provided for this image

Model Performance

The tuned XGBoost model demonstrates a noticeable improvement in recall, increasing from 0.27 to 0.33, which indicates that it now identifies a larger proportion of fraudulent transactions.

This gain enhances the model's sensitivity to the minority class. The ROC-AUC remained unchanged at 0.87.

However, the recall gains come with trade-offs: precision dropped slightly to 0.05, and the F1-score dropped from 0.11 to 0.09, revealing that the model still produces many false positives and struggles to balance precision and recall effectively.

When compared with the other models evaluated, Logistic Regression with SMOTE continues to provide the best overall balance between accuracy, stability, and interpretability.

Logistic Regression remains the most reliable and consistent performer for this dataset.

🧠 Neural Network¶

After testing traditional machine learning models, we now move toward deep learning to explore whether a neural network can better capture the nonlinear and hidden relationships that define fraudulent behavior.

Unlike tree-based or linear models, neural networks can learn complex, high-dimensional feature interactions directly from data, a capability that may uncover subtle fraud patterns that previous models overlooked.

To integrate this approach seamlessly into our workflow, we implement a custom PyTorch model wrapper (TorchNNWraper) that follows scikit-learn's BaseEstimator and ClassifierMixin interfaces. This design allows us to train, evaluate, and compare the neural network just like any other scikit-learn model, preserving a consistent pipeline structure while leveraging PyTorch's flexibility and computational power.

In essence, this step represents an effort to combine the interpretability and structure of our existing pipeline with the expressive power of deep learning, aiming to push beyond the limitations encountered with traditional algorithms.

Network class¶

Neural Network Architecture and Training Details

Our PyTorch neural network, designed for this fraud detection task, employs a multi-layer perceptron (MLP) architecture to capture complex patterns in the data. The network is structured as follows:

  • Layers: The model consists of three hidden layers with ReLU activation functions:

    • The first two layers have 256 neurons each, followed by Batch Normalization and Dropout (with a rate of 0.3).
    • The third layer has 128 neurons, also followed by Batch Normalization and Dropout.
    • Batch Normalization helps stabilize and accelerate the training process by normalizing the inputs to each layer.
    • Dropout acts as a regularization technique by randomly setting a fraction of neurons to zero during training, which helps prevent overfitting, especially important for handling the imbalance and diversity of the dataset.
    • The final layer is a single output neuron with no activation function, producing a raw logit score for binary classification.
  • Loss Function: We use torch.nn.BCEWithLogitsLoss. This loss function is well-suited for binary classification as it combines a sigmoid layer and the Binary Cross Entropy loss in a single, numerically stable function. Crucially, it allows us to directly apply class weights via the pos_weight parameter to address the severe class imbalance by giving more importance to the minority (fraudulent) class during training.

  • Optimizer: The Adam optimizer is utilized with a learning rate of 1e-3. Adam is an adaptive learning rate optimization algorithm that is widely used for training deep neural networks. Its efficiency and effectiveness in handling large datasets makes it a suitable choice for this problem.

  • Hyperparameter Selection: The architecture (number of layers, neurons per layer) and hyperparameters (like learning rate, batch size, and dropout rate) were determined through empirical experimentation. In a production setting, a more rigorous approach such as cross-validation combined with hyperparameter tuning libraries would be employed to systematically search for the optimal configuration.

Validation and Early Stopping

To prevent overfitting and improve generalization, we separate a validation subset (10%) from the training data during each training run. After every epoch, the model's performance is evaluated on this validation set.

If the validation loss does not improve for a predefined number of epochs (early_stopping_patience=5), training stops automatically.

This ensures the model retains the weights from the epoch with the best validation performance, preventing unnecessary training and reducing the risk of overfitting to the training data.

In [107]:
class TorchNNWrapper(BaseEstimator, ClassifierMixin):
    def __init__(self,
                 input_dim=None,
                 lr=1e-4,
                 batch_size=2048,
                 epochs=100,
                 dropout=0.3,
                 threshold = 0.5,
                 class_weight=None,
                 early_stopping_patience=20,
                 val_split=0.2,
                 device=None,
                 verbose=True):
        self.input_dim = input_dim
        self.lr = lr
        self.batch_size = batch_size
        self.epochs = epochs
        self.dropout = dropout
        self.threshold = threshold
        self.class_weight = class_weight
        self.early_stopping_patience = early_stopping_patience
        self.val_split = val_split
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.verbose = verbose
        self.model_ = None
        self.train_losses_ = []
        self.val_losses_ = []


    # Define NN architecture
    def _build_model(self):
        model = nn.Sequential(
            # First layer
            nn.Linear(self.input_dim, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(self.dropout),

            # Second layer
            nn.Linear(256, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(self.dropout),

            # Third layer
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(self.dropout),

            nn.Linear(128, 1)
        )
        return model.to(self.device)

    # Training
    def fit(self, X, y):
        # Convert to tensors
        X = torch.tensor(np.asarray(X), dtype=torch.float32)
        y = torch.tensor(np.asarray(y).reshape(-1, 1), dtype=torch.float32)

        # Split into training and validation
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=self.val_split, stratify=y.cpu(), random_state=42
        )

        train_loader = DataLoader(TensorDataset(X_train, y_train),
                                  batch_size=self.batch_size, shuffle=True)
        val_loader = DataLoader(TensorDataset(X_val, y_val),
                                batch_size=self.batch_size, shuffle=False)

        # Build model and loss
        self.input_dim = X.shape[1]
        self.model_ = self._build_model()

        if self.class_weight is not None:
            weight_tensor = torch.tensor([self.class_weight[1]], dtype=torch.float32).to(self.device)
            criterion = nn.BCEWithLogitsLoss(pos_weight=weight_tensor)
        else:
            criterion = nn.BCEWithLogitsLoss()

        optimizer = optim.Adam(self.model_.parameters(), lr=self.lr)

        # Early stopping setup
        best_val_loss = float('inf')
        best_state_dict = None
        patience_counter = 0
        self.train_losses_ = []
        self.val_losses_ = []

        # Training loop
        for epoch in range(self.epochs):
            self.model_.train()
            running_loss = 0.0
            for xb, yb in train_loader:
                xb, yb = xb.to(self.device), yb.to(self.device)
                optimizer.zero_grad()
                outputs = self.model_(xb)
                loss = criterion(outputs.view(-1), yb.view(-1))
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
            avg_train_loss = running_loss / len(train_loader)
            self.train_losses_.append(avg_train_loss)

            # Validation
            self.model_.eval()
            val_loss = 0.0
            with torch.no_grad():
                for xb, yb in val_loader:
                    xb, yb = xb.to(self.device), yb.to(self.device)
                    outputs = self.model_(xb)
                    loss = criterion(outputs.view(-1), yb.view(-1))
                    val_loss += loss.item()
            avg_val_loss = val_loss / len(val_loader)
            self.val_losses_.append(avg_val_loss)

            # Print progress
            if self.verbose:
                print(f"Epoch {epoch+1}/{self.epochs} | Train Loss: {avg_train_loss:.4f} | Current Val Loss: {avg_val_loss:.4f} | Best Val Loss: {best_val_loss:.4f}")

            # Early stopping logic
            if avg_val_loss < best_val_loss - 1e-4:
                best_val_loss = avg_val_loss
                best_state_dict = {k: v.cpu().clone() for k, v in self.model_.state_dict().items()}
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= self.early_stopping_patience:
                    if self.verbose:
                        print(f"Early stopping at epoch {epoch+1} (no improvement for {self.early_stopping_patience} epochs)")
                    break

        # Restore best weights
        if best_state_dict is not None:
            self.model_.load_state_dict(best_state_dict)
            self.model_.to(self.device)

        return self

    # Predict
    def predict_proba(self, X):
        self.model_.eval()
        X = torch.tensor(np.asarray(X), dtype=torch.float32).to(self.device)
        with torch.no_grad():
            logits = self.model_(X)
            probs = torch.sigmoid(logits).cpu().numpy().flatten()
        return np.vstack([1 - probs, probs]).T

    def predict(self, X):
        return (self.predict_proba(X)[:, 1] >= self.threshold).astype(int)

To address the severe class imbalance in our dataset, we explored two complementary strategies:

  1. Class weighting - assigning higher penalties to fraudulent transactions during training, with weight ratios ranging from 1 to 100. This approach aimed to make the neural network more sensitive to missed fraud cases.

  2. SMOTE - The best results were achieved when fraudulent transactions comprised approximately 20% of the dataset

Ultimately, SMOTE alone (without class weighting) provided the most stable and accurate performance, striking a better balance between recall and overall model reliability - as shown later in the analysis

We will now integrate the neural network into the pipeline we've previously created:

In [108]:
# Combine into Pipeline
pipe_nn = Pipeline([
    ("preprocess", preprocess),
    ("smote", smote),
    ("nn", TorchNNWrapper(
        epochs=100,
        batch_size=4096,
        lr=5e-5
    ))
])
Training¶
In [109]:
# Train & Evaluate
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

pipe_nn.fit(X_train, y_train)

y_pred_nn = pipe_nn.predict(X_test)
y_proba_nn = pipe_nn.predict_proba(X_test)

print(classification_report(y_test, y_pred_nn))
print("ROC AUC:", roc_auc_score(y_test, y_proba_nn[:, 1]))
Epoch 1/100 | Train Loss: 0.5896 | Current Val Loss: 0.4554 | Best Val Loss: inf
Epoch 2/100 | Train Loss: 0.4084 | Current Val Loss: 0.3261 | Best Val Loss: 0.4554
Epoch 3/100 | Train Loss: 0.3298 | Current Val Loss: 0.2876 | Best Val Loss: 0.3261
Epoch 4/100 | Train Loss: 0.2980 | Current Val Loss: 0.2711 | Best Val Loss: 0.2876
Epoch 5/100 | Train Loss: 0.2807 | Current Val Loss: 0.2550 | Best Val Loss: 0.2711
Epoch 6/100 | Train Loss: 0.2693 | Current Val Loss: 0.2490 | Best Val Loss: 0.2550
Epoch 7/100 | Train Loss: 0.2606 | Current Val Loss: 0.2390 | Best Val Loss: 0.2490
Epoch 8/100 | Train Loss: 0.2527 | Current Val Loss: 0.2334 | Best Val Loss: 0.2390
Epoch 9/100 | Train Loss: 0.2450 | Current Val Loss: 0.2308 | Best Val Loss: 0.2334
Epoch 10/100 | Train Loss: 0.2389 | Current Val Loss: 0.2402 | Best Val Loss: 0.2308
Epoch 11/100 | Train Loss: 0.2329 | Current Val Loss: 0.2207 | Best Val Loss: 0.2308
Epoch 12/100 | Train Loss: 0.2276 | Current Val Loss: 0.2109 | Best Val Loss: 0.2207
Epoch 13/100 | Train Loss: 0.2226 | Current Val Loss: 0.1981 | Best Val Loss: 0.2109
Epoch 14/100 | Train Loss: 0.2183 | Current Val Loss: 0.2048 | Best Val Loss: 0.1981
Epoch 15/100 | Train Loss: 0.2139 | Current Val Loss: 0.1983 | Best Val Loss: 0.1981
Epoch 16/100 | Train Loss: 0.2104 | Current Val Loss: 0.1876 | Best Val Loss: 0.1981
Epoch 17/100 | Train Loss: 0.2068 | Current Val Loss: 0.1902 | Best Val Loss: 0.1876
Epoch 18/100 | Train Loss: 0.2031 | Current Val Loss: 0.1833 | Best Val Loss: 0.1876
Epoch 19/100 | Train Loss: 0.2002 | Current Val Loss: 0.1767 | Best Val Loss: 0.1833
Epoch 20/100 | Train Loss: 0.1960 | Current Val Loss: 0.1960 | Best Val Loss: 0.1767
Epoch 21/100 | Train Loss: 0.1936 | Current Val Loss: 0.1783 | Best Val Loss: 0.1767
Epoch 22/100 | Train Loss: 0.1903 | Current Val Loss: 0.1635 | Best Val Loss: 0.1767
Epoch 23/100 | Train Loss: 0.1874 | Current Val Loss: 0.1642 | Best Val Loss: 0.1635
Epoch 24/100 | Train Loss: 0.1838 | Current Val Loss: 0.1582 | Best Val Loss: 0.1635
Epoch 25/100 | Train Loss: 0.1805 | Current Val Loss: 0.1596 | Best Val Loss: 0.1582
Epoch 26/100 | Train Loss: 0.1779 | Current Val Loss: 0.1519 | Best Val Loss: 0.1582
Epoch 27/100 | Train Loss: 0.1743 | Current Val Loss: 0.1543 | Best Val Loss: 0.1519
Epoch 28/100 | Train Loss: 0.1709 | Current Val Loss: 0.1516 | Best Val Loss: 0.1519
Epoch 29/100 | Train Loss: 0.1679 | Current Val Loss: 0.1445 | Best Val Loss: 0.1516
Epoch 30/100 | Train Loss: 0.1637 | Current Val Loss: 0.1432 | Best Val Loss: 0.1445
Epoch 31/100 | Train Loss: 0.1603 | Current Val Loss: 0.1304 | Best Val Loss: 0.1432
Epoch 32/100 | Train Loss: 0.1563 | Current Val Loss: 0.1608 | Best Val Loss: 0.1304
Epoch 33/100 | Train Loss: 0.1522 | Current Val Loss: 0.1392 | Best Val Loss: 0.1304
Epoch 34/100 | Train Loss: 0.1475 | Current Val Loss: 0.1323 | Best Val Loss: 0.1304
Epoch 35/100 | Train Loss: 0.1414 | Current Val Loss: 0.1302 | Best Val Loss: 0.1304
Epoch 36/100 | Train Loss: 0.1360 | Current Val Loss: 0.1579 | Best Val Loss: 0.1302
Epoch 37/100 | Train Loss: 0.1332 | Current Val Loss: 0.1205 | Best Val Loss: 0.1302
Epoch 38/100 | Train Loss: 0.1273 | Current Val Loss: 0.2724 | Best Val Loss: 0.1205
Epoch 39/100 | Train Loss: 0.1211 | Current Val Loss: 0.1180 | Best Val Loss: 0.1205
Epoch 40/100 | Train Loss: 0.1190 | Current Val Loss: 0.1208 | Best Val Loss: 0.1180
Epoch 41/100 | Train Loss: 0.1174 | Current Val Loss: 0.1260 | Best Val Loss: 0.1180
Epoch 42/100 | Train Loss: 0.1112 | Current Val Loss: 0.0932 | Best Val Loss: 0.1180
Epoch 43/100 | Train Loss: 0.1141 | Current Val Loss: 0.1052 | Best Val Loss: 0.0932
Epoch 44/100 | Train Loss: 0.1092 | Current Val Loss: 0.1994 | Best Val Loss: 0.0932
Epoch 45/100 | Train Loss: 0.1060 | Current Val Loss: 0.1578 | Best Val Loss: 0.0932
Epoch 46/100 | Train Loss: 0.1057 | Current Val Loss: 0.0804 | Best Val Loss: 0.0932
Epoch 47/100 | Train Loss: 0.1039 | Current Val Loss: 0.0831 | Best Val Loss: 0.0804
Epoch 48/100 | Train Loss: 0.1021 | Current Val Loss: 0.1935 | Best Val Loss: 0.0804
Epoch 49/100 | Train Loss: 0.1009 | Current Val Loss: 0.1745 | Best Val Loss: 0.0804
Epoch 50/100 | Train Loss: 0.1005 | Current Val Loss: 0.0863 | Best Val Loss: 0.0804
Epoch 51/100 | Train Loss: 0.0959 | Current Val Loss: 0.0768 | Best Val Loss: 0.0804
Epoch 52/100 | Train Loss: 0.0968 | Current Val Loss: 0.0855 | Best Val Loss: 0.0768
Epoch 53/100 | Train Loss: 0.0938 | Current Val Loss: 0.4512 | Best Val Loss: 0.0768
Epoch 54/100 | Train Loss: 0.0917 | Current Val Loss: 0.1038 | Best Val Loss: 0.0768
Epoch 55/100 | Train Loss: 0.0908 | Current Val Loss: 0.0819 | Best Val Loss: 0.0768
Epoch 56/100 | Train Loss: 0.0898 | Current Val Loss: 0.2161 | Best Val Loss: 0.0768
Epoch 57/100 | Train Loss: 0.0891 | Current Val Loss: 0.0800 | Best Val Loss: 0.0768
Epoch 58/100 | Train Loss: 0.0872 | Current Val Loss: 0.0712 | Best Val Loss: 0.0768
Epoch 59/100 | Train Loss: 0.0866 | Current Val Loss: 0.2831 | Best Val Loss: 0.0712
Epoch 60/100 | Train Loss: 0.0849 | Current Val Loss: 0.0833 | Best Val Loss: 0.0712
Epoch 61/100 | Train Loss: 0.0841 | Current Val Loss: 0.2082 | Best Val Loss: 0.0712
Epoch 62/100 | Train Loss: 0.0838 | Current Val Loss: 0.1042 | Best Val Loss: 0.0712
Epoch 63/100 | Train Loss: 0.0836 | Current Val Loss: 0.0925 | Best Val Loss: 0.0712
Epoch 64/100 | Train Loss: 0.0832 | Current Val Loss: 0.2715 | Best Val Loss: 0.0712
Epoch 65/100 | Train Loss: 0.0796 | Current Val Loss: 0.1622 | Best Val Loss: 0.0712
Epoch 66/100 | Train Loss: 0.0803 | Current Val Loss: 0.0660 | Best Val Loss: 0.0712
Epoch 67/100 | Train Loss: 0.0791 | Current Val Loss: 0.0677 | Best Val Loss: 0.0660
Epoch 68/100 | Train Loss: 0.0817 | Current Val Loss: 0.1238 | Best Val Loss: 0.0660
Epoch 69/100 | Train Loss: 0.0765 | Current Val Loss: 0.1383 | Best Val Loss: 0.0660
Epoch 70/100 | Train Loss: 0.0769 | Current Val Loss: 0.2001 | Best Val Loss: 0.0660
Epoch 71/100 | Train Loss: 0.0775 | Current Val Loss: 0.0656 | Best Val Loss: 0.0660
Epoch 72/100 | Train Loss: 0.0755 | Current Val Loss: 0.1318 | Best Val Loss: 0.0656
Epoch 73/100 | Train Loss: 0.0742 | Current Val Loss: 0.0659 | Best Val Loss: 0.0656
Epoch 74/100 | Train Loss: 0.0765 | Current Val Loss: 0.1926 | Best Val Loss: 0.0656
Epoch 75/100 | Train Loss: 0.0735 | Current Val Loss: 0.3957 | Best Val Loss: 0.0656
Epoch 76/100 | Train Loss: 0.0795 | Current Val Loss: 0.0933 | Best Val Loss: 0.0656
Epoch 77/100 | Train Loss: 0.0713 | Current Val Loss: 0.0607 | Best Val Loss: 0.0656
Epoch 78/100 | Train Loss: 0.0741 | Current Val Loss: 0.0709 | Best Val Loss: 0.0607
Epoch 79/100 | Train Loss: 0.0718 | Current Val Loss: 0.0815 | Best Val Loss: 0.0607
Epoch 80/100 | Train Loss: 0.0725 | Current Val Loss: 0.0732 | Best Val Loss: 0.0607
Epoch 81/100 | Train Loss: 0.0708 | Current Val Loss: 0.0881 | Best Val Loss: 0.0607
Epoch 82/100 | Train Loss: 0.0682 | Current Val Loss: 0.1026 | Best Val Loss: 0.0607
Epoch 83/100 | Train Loss: 0.0680 | Current Val Loss: 0.2170 | Best Val Loss: 0.0607
Epoch 84/100 | Train Loss: 0.0660 | Current Val Loss: 0.1474 | Best Val Loss: 0.0607
Epoch 85/100 | Train Loss: 0.0668 | Current Val Loss: 0.1100 | Best Val Loss: 0.0607
Epoch 86/100 | Train Loss: 0.0687 | Current Val Loss: 0.0699 | Best Val Loss: 0.0607
Epoch 87/100 | Train Loss: 0.0662 | Current Val Loss: 0.0456 | Best Val Loss: 0.0607
Epoch 88/100 | Train Loss: 0.0669 | Current Val Loss: 0.1338 | Best Val Loss: 0.0456
Epoch 89/100 | Train Loss: 0.0707 | Current Val Loss: 0.0896 | Best Val Loss: 0.0456
Epoch 90/100 | Train Loss: 0.0670 | Current Val Loss: 0.0803 | Best Val Loss: 0.0456
Epoch 91/100 | Train Loss: 0.0686 | Current Val Loss: 0.0438 | Best Val Loss: 0.0456
Epoch 92/100 | Train Loss: 0.0638 | Current Val Loss: 0.3370 | Best Val Loss: 0.0438
Epoch 93/100 | Train Loss: 0.0662 | Current Val Loss: 0.0708 | Best Val Loss: 0.0438
Epoch 94/100 | Train Loss: 0.0654 | Current Val Loss: 0.4103 | Best Val Loss: 0.0438
Epoch 95/100 | Train Loss: 0.0627 | Current Val Loss: 0.0547 | Best Val Loss: 0.0438
Epoch 96/100 | Train Loss: 0.0620 | Current Val Loss: 0.2048 | Best Val Loss: 0.0438
Epoch 97/100 | Train Loss: 0.0606 | Current Val Loss: 0.0500 | Best Val Loss: 0.0438
Epoch 98/100 | Train Loss: 0.0629 | Current Val Loss: 0.0765 | Best Val Loss: 0.0438
Epoch 99/100 | Train Loss: 0.0622 | Current Val Loss: 0.0790 | Best Val Loss: 0.0438
Epoch 100/100 | Train Loss: 0.0599 | Current Val Loss: 0.1165 | Best Val Loss: 0.0438
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.45      0.63      0.52      2145

    accuracy                           1.00    555719
   macro avg       0.72      0.81      0.76    555719
weighted avg       1.00      1.00      1.00    555719

ROC AUC: 0.9695072674726705

💡 Note: Predictions were made using the model saved at epoch 74, where training stopped due to early stopping (patience = 20)

In [ ]:
# Access the trained neural network model from the pipeline
nn_model = pipe_nn.named_steps['nn']

# Plot training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(nn_model.train_losses_, label='Training Loss')
plt.plot(nn_model.val_losses_, label='Validation Loss')
plt.title('Training and Validation Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

Training Analysis

The chart illustrates the evolution of training and validation loss across 94 epochs, with early stopping activated.

During the initial phase (epochs 0-20), both losses decrease sharply, reflecting rapid learning and effective convergence of the model.

In the middle phase (epochs 20-50), training loss continues to decline steadily, while validation loss fluctuates mildly but remains generally consistent - suggesting that the model generalizes well at this stage.

In the final phase (epochs 50-94), validation loss becomes increasingly unstable, showing sharp oscillations while training loss stays low - a clear sign of emerging overfitting.

The early stopping mechanism successfully halted training at the optimal point (epoch 74) preserving the best balance between model fit and generalization performance on unseen data

Results¶

To evaluate the model's performance, we first define a function, called visualize_model_performance(), which generates a graphical comparison of the results

In [111]:
def visualize_model_performance(precision, recall, f1, roc, model_name="Model"):
    """
    Display Precision, Recall, F1-score, and ROC-AUC for a single model
    and visualize results as a bar chart.
    """

    # Metrics and values
    metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']
    scores = [precision, recall, f1, roc]

    # Print table
    results_df = pd.DataFrame({
        'Metric': metrics,
        model_name: np.round(scores, 3)
    })
    print(results_df.to_string(index=False))
    print()

    # Bar chart
    x = np.arange(len(metrics))
    fig, ax = plt.subplots(figsize=(7, 5))
    bars = ax.bar(x, scores, color='#4C72B0', width=0.6)

    # Annotate bars
    for bar in bars:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
                f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=9)

    # Aesthetics
    ax.set_ylabel('Score')
    ax.set_ylim(0, 1.05)
    ax.set_xticks(x)
    ax.set_xticklabels(metrics)
    ax.set_title(f'{model_name}: Performance Metrics')
    ax.grid(axis='y', linestyle='--', alpha=0.6)

    plt.tight_layout()
    plt.show()
In [112]:
visualize_model_performance(recall=0.64, precision=0.28, f1=0.39, roc=0.95543, model_name="Neural Network")
   Metric  Neural Network
Precision            0.28
   Recall            0.64
 F1-score            0.39
  ROC-AUC            0.95

No description has been provided for this image

Model Performance

The Neural Network model delivers a notable leap in performance across all key metrics, confirming its ability to capture complex, non-linear patterns in the data.

With a recall of 0.64, the model successfully detects nearly two-thirds of all fraudulent transactions, which is a substantial improvement over the previous models. Although precision (0.28) remains moderate, this trade-off is often acceptable in fraud detection, where minimizing missed frauds (high recall) is far more critical than avoiding every false alarm.

The F1-score of 0.39 demonstrates a balanced compromise between precision and recall, highlighting the model's stronger overall detection capability.

Moreover, the ROC-AUC of 0.96 indicates excellent class separability, showing that the network effectively distinguishes between fraudulent and legitimate transactions based on its predicted probabilities.

Overall, this Neural Network is the first model to surpass the Logistic Regression baseline, achieving higher recall and superior discriminative power. It stands out as the most effective approach so far for this dataset, combining robust learning capacity with meaningful real-world applicability in fraud detection.

Effect of Class Weights on Neural Network Performance:

In [113]:
# Create DataFrame
data = {
    "ROC": [0.863, 0.859, 0.864, 0.856, 0.865, 0.859, 0.868, 0.865, 0.859, 0.821, 0.865, 0.853, 0.843, 0.808, 0.854],
    "recall": [0.075, 0.076, 0.097, 0.092, 0.307, 0.276, 0.382, 0.487, 0.374, 0.361, 0.367, 0.447, 0.489, 0.312, 0.647],
    "precision": [0.976, 0.994, 0.835, 0.399, 0.135, 0.207, 0.095, 0.045, 0.081, 0.044, 0.059, 0.046, 0.038, 0.042, 0.022],
    "ratio": [1, 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 100]
}
df = pd.DataFrame(data)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(df["ratio"], df["ROC"], marker='o', label="ROC-AUC", linewidth=2)
plt.plot(df["ratio"], df["recall"], marker='o', label="Recall", linewidth=2)
plt.plot(df["ratio"], df["precision"], marker='o', label="Precision", linewidth=2)

# Formatting
plt.xscale("log")  # log-scale for better visibility of large ratios
plt.xlabel("Minority-to-Majority Class Weight Ratio (log scale)", fontsize=12)
plt.ylabel("Score (0–1)", fontsize=12)
plt.title("Effect of Class Weights on Neural Network Performance", fontsize=14, weight='bold')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Effect of Class Weights on Neural Network Performance

💡 Note: the following analysis was conducted using class weights only, without applying SMOTE, to isolate the effect of weighting on model performance

The graph illustrates how varying the minority-to-majority class weight ratio affects the neural network's precision and recall.

As the weight for the minority (fraud) class increases, recall improves, and the model becomes more sensitive to fraud - but precision decreases, resulting in more false positives.

Meanwhile, the ROC-AUC remains largely stable, indicating that the model's overall ability to separate fraud from legitimate transactions is not significantly impacted.

In practice, the optimal class weight depends on the objective:

  • To minimize missed frauds, increase the class weight (favor recall)
  • To reduce false positives, lower the class weight (favor precision)

This trade-off provides a flexible way to fine-tune the model's behavior according to operational priorities.

🤖 TabNet Classifier¶

To push our fraud detection analysis further, we introduce a deep neural architecture specifically designed for tabular data.

Unlike traditional models that rely on manual feature engineering, TabNet uses sequential attention to dynamically select the most informative features at each decision step. This allows it to learn complex, non-linear relationships while maintaining a degree of interpretability - something rare among deep learning models.

We expect TabNet to outperform previous models by capturing subtle fraud patterns that logistic regression and tree-based methods may overlook. Its built-in handling of feature sparsity and interpretability makes it a promising candidate for highly imbalanced fraud detection tasks

Training¶
In [114]:
!pip install pytorch-tabnet torch scikit-learn pandas numpy --quiet
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/44.5 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.5/44.5 kB 3.4 MB/s eta 0:00:00
In [115]:
from pytorch_tabnet.tab_model import TabNetClassifier

# Ensuring GPU accessability
device_name = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device_name.upper()}")

set_global_seed(42)
Using device: CUDA

Without SMOTE:

In [117]:
# Data Preparation

X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)


# Preprocess
X_train_proc = preprocess.fit_transform(X_train, y_train)
X_test_proc = preprocess.transform(X_test)
In [118]:
# Training

tabnet = TabNetClassifier(
    n_d=32, n_a=32, n_steps=5,
    gamma=1.5, lambda_sparse=1e-4,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=1e-3),
    mask_type='entmax',
    device_name=device_name
)

tabnet.fit(
    X_train_proc, y_train.values,
    max_epochs=10, # keep small — loss converges fast
    patience=5,    # stop after 5 epochs of no improvement
    batch_size=2048,
    virtual_batch_size=256,
    num_workers=0,
    drop_last=False
)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
  warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
  warnings.warn(wrn_msg)
epoch 0  | loss: 0.05758 |  0:00:42s
epoch 1  | loss: 0.0203  |  0:01:22s
epoch 2  | loss: 0.0175  |  0:02:01s
epoch 3  | loss: 0.01536 |  0:02:41s
epoch 4  | loss: 0.01409 |  0:03:20s
epoch 5  | loss: 0.01367 |  0:04:00s
epoch 6  | loss: 0.01259 |  0:04:39s
epoch 7  | loss: 0.01154 |  0:05:19s
epoch 8  | loss: 0.0108  |  0:05:59s
epoch 9  | loss: 0.01017 |  0:06:39s
In [119]:
y_pred_tabnet = tabnet.predict(X_test_proc)
y_proba_tabnet = tabnet.predict_proba(X_test_proc)[:, 1]


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tabnet))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tabnet))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_tabnet))
Confusion Matrix:
 [[553439    135]
 [  1472    673]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.83      0.31      0.46      2145

    accuracy                           1.00    555719
   macro avg       0.92      0.66      0.73    555719
weighted avg       1.00      1.00      1.00    555719


ROC-AUC Score: 0.9499632761462254
In [120]:
# Visualize Results
plot_model_performance(y_test, y_pred_tabnet, y_proba_tabnet, model_name="TabNet (Without SMOTE)")
No description has been provided for this image

With SMOTE:

To ensure a fair comparison with previous models, we extend TabNet with SMOTE oversampling within a custom scikit-learn compatible pipeline.

We tested two oversampling values: 0.1 and 0.2, meaning that we trained 2 different models. In the first model the minority class was oversampled to be 0.1% of the entire data, and 0.2% in the second model.

In [121]:
# DataFrame Wrapper (preserve feature names)
class DataFrameWrapper(TransformerMixin, BaseEstimator):
    """
    Wrap any transformer so its output is returned as a pandas DataFrame
    """
    def __init__(self, transformer):
        self.transformer = transformer

    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        return self

    def transform(self, X):
        Xt = self.transformer.transform(X)
        # try to preserve feature names
        try:
            cols = self.transformer.get_feature_names_out()
        except Exception:
            cols = [f"col_{i}" for i in range(Xt.shape[1])]
        return pd.DataFrame(Xt, columns=cols, index=X.index)
In [122]:
class TabNetWrapper(BaseEstimator, ClassifierMixin):
    """
    SKlearn-style wrapper around PyTorch TabNet
    """
    def __init__(self, **kwargs):
        self.model_params = kwargs
        self.model_ = None

    def fit(self, X, y):
        X_np = np.asarray(X, dtype=np.float32)
        y_np = np.asarray(y, dtype=np.int64)

        self.model_ = TabNetClassifier(**self.model_params)
        self.model_.fit(
            X_np, y_np,
            max_epochs=10,
            patience=5,
            batch_size=2048,
            virtual_batch_size=256,
            num_workers=0,
            drop_last=False,
        )
        return self

    def predict(self, X):
        X_np = np.asarray(X, dtype=np.float32)
        return self.model_.predict(X_np)

    def predict_proba(self, X):
        X_np = np.asarray(X, dtype=np.float32)
        return self.model_.predict_proba(X_np)

First model: Tabnet with SMOTE 0.1%

In [128]:
# Data Preparation (same as before)
# version with SMOTE 0.1
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

# Defining TabNet parameters
tabnet = TabNetWrapper(
    n_d=32,
    n_a=32,
    n_steps=5,
    gamma=1.5,
    lambda_sparse=1e-4,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=1e-3),
    mask_type='entmax',
    device_name='cuda' if torch.cuda.is_available() else 'cpu'
)


# Build full sklearn pipeline
smote = SMOTE(
    sampling_strategy=0.1,  # minority class will be 10% of majority
    random_state=42,
    k_neighbors=5
)

steps = [
    ("preprocess", DataFrameWrapper(preprocess)),  # existing ColumnTransformer
    ("smote", smote),
    ("tabnet", tabnet)
]

pipe = Pipeline(steps)

# Train and evaluate
pipe.fit(X_train, y_train)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
  warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
  warnings.warn(wrn_msg)
epoch 0  | loss: 0.14633 |  0:00:44s
epoch 1  | loss: 0.06844 |  0:01:28s
epoch 2  | loss: 0.05102 |  0:02:12s
epoch 3  | loss: 0.04002 |  0:02:57s
epoch 4  | loss: 0.03154 |  0:03:41s
epoch 5  | loss: 0.0278  |  0:04:25s
epoch 6  | loss: 0.02381 |  0:05:10s
epoch 7  | loss: 0.02155 |  0:05:54s
epoch 8  | loss: 0.01969 |  0:06:39s
epoch 9  | loss: 0.01801 |  0:07:23s
Out[128]:
Pipeline(steps=[('preprocess',
                 DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['merchant']),
                                                                              ('job_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['job']),
                                                                              ('city_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['city']),
                                                                              ('state_rate',
                                                                               FraudRateEncoder(min_sampl...
                                                                                'gender']),
                                                                              ('cyclical_time',
                                                                               CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                                               'hour': 24,
                                                                                                               'month': 12}),
                                                                               ['hour',
                                                                                'day_of_week',
                                                                                'month']),
                                                                              ('scaler',
                                                                               MinMaxScaler(),
                                                                               ['amt',
                                                                                'city_pop',
                                                                                'distance_cardholder_merchant',
                                                                                'age',
                                                                                'card_prev_fraud_ratio'])],
                                                                verbose_feature_names_out=False))),
                ('smote', SMOTE(random_state=42, sampling_strategy=0.1)),
                ('tabnet', TabNetWrapper())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess',
                 DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['merchant']),
                                                                              ('job_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['job']),
                                                                              ('city_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['city']),
                                                                              ('state_rate',
                                                                               FraudRateEncoder(min_sampl...
                                                                                'gender']),
                                                                              ('cyclical_time',
                                                                               CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                                               'hour': 24,
                                                                                                               'month': 12}),
                                                                               ['hour',
                                                                                'day_of_week',
                                                                                'month']),
                                                                              ('scaler',
                                                                               MinMaxScaler(),
                                                                               ['amt',
                                                                                'city_pop',
                                                                                'distance_cardholder_merchant',
                                                                                'age',
                                                                                'card_prev_fraud_ratio'])],
                                                                verbose_feature_names_out=False))),
                ('smote', SMOTE(random_state=42, sampling_strategy=0.1)),
                ('tabnet', TabNetWrapper())])
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['merchant']),
                                                             ('job_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['job']),
                                                             ('city_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['city']),
                                                             ('state_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['state'...
                                                             ('onehot_low',
                                                              OneHotEncoder(handle_unknown='ignore',
                                                                            sparse_output=False),
                                                              ['category',
                                                               'gender']),
                                                             ('cyclical_time',
                                                              CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                              'hour': 24,
                                                                                              'month': 12}),
                                                              ['hour',
                                                               'day_of_week',
                                                               'month']),
                                                             ('scaler',
                                                              MinMaxScaler(),
                                                              ['amt',
                                                               'city_pop',
                                                               'distance_cardholder_merchant',
                                                               'age',
                                                               'card_prev_fraud_ratio'])],
                                               verbose_feature_names_out=False))
ColumnTransformer(transformers=[('merchant_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['merchant']),
                                ('job_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['job']),
                                ('city_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['city']),
                                ('state_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['state']),
                                ('card_freq', CardFrequency...
                                ('onehot_low',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['category', 'gender']),
                                ('cyclical_time',
                                 CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                 'hour': 24,
                                                                 'month': 12}),
                                 ['hour', 'day_of_week', 'month']),
                                ('scaler', MinMaxScaler(),
                                 ['amt', 'city_pop',
                                  'distance_cardholder_merchant', 'age',
                                  'card_prev_fraud_ratio'])],
                  verbose_feature_names_out=False)
['merchant']
FraudRateEncoder(min_samples=100, smoothing=100)
['job']
FraudRateEncoder(min_samples=100, smoothing=100)
['city']
FraudRateEncoder(min_samples=100, smoothing=100)
['state']
FraudRateEncoder(min_samples=100, smoothing=100)
['cc_num']
CardFrequencyEncoder()
['category', 'gender']
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['hour', 'day_of_week', 'month']
CyclicalTimeEncoder(period_map={'day_of_week': 7, 'hour': 24, 'month': 12})
['amt', 'city_pop', 'distance_cardholder_merchant', 'age', 'card_prev_fraud_ratio']
MinMaxScaler()
SMOTE(random_state=42, sampling_strategy=0.1)
TabNetWrapper()
In [129]:
# Evaluate
y_pred_tabnet_smote01 = pipe.predict(X_test)
y_proba_tabnet_smote01 = pipe.predict_proba(X_test)[:, 1]
In [131]:
plot_model_performance(y_test, y_pred_tabnet_smote01, y_proba_tabnet_smote01, model_name="TabNet SMOTE - 0.1")
No description has been provided for this image

First model: Tabnet with SMOTE 0.2%

In [123]:
# Data Preparation (same as before)

X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)

X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)

# Defining TabNet parameters
tabnet = TabNetWrapper(
    n_d=32,
    n_a=32,
    n_steps=5,
    gamma=1.5,
    lambda_sparse=1e-4,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=1e-3),
    mask_type='entmax',
    device_name='cuda' if torch.cuda.is_available() else 'cpu'
)


# Build full sklearn pipeline
smote = SMOTE(
    sampling_strategy=0.2,  # minority class will be 20% of majority
    random_state=42,
    k_neighbors=5
)

steps = [
    ("preprocess", DataFrameWrapper(preprocess)),  # existing ColumnTransformer
    ("smote", smote),
    ("tabnet", tabnet)
]

pipe = Pipeline(steps)

# Train and evaluate
pipe.fit(X_train, y_train)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
  warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
  warnings.warn(wrn_msg)
epoch 0  | loss: 0.18298 |  0:00:49s
epoch 1  | loss: 0.0804  |  0:01:37s
epoch 2  | loss: 0.05342 |  0:02:26s
epoch 3  | loss: 0.04007 |  0:03:15s
epoch 4  | loss: 0.03217 |  0:04:03s
epoch 5  | loss: 0.02855 |  0:04:52s
epoch 6  | loss: 0.02505 |  0:05:40s
epoch 7  | loss: 0.02303 |  0:06:28s
epoch 8  | loss: 0.0242  |  0:07:16s
epoch 9  | loss: 0.02155 |  0:08:04s
Out[123]:
Pipeline(steps=[('preprocess',
                 DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['merchant']),
                                                                              ('job_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['job']),
                                                                              ('city_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['city']),
                                                                              ('state_rate',
                                                                               FraudRateEncoder(min_sampl...
                                                                                'gender']),
                                                                              ('cyclical_time',
                                                                               CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                                               'hour': 24,
                                                                                                               'month': 12}),
                                                                               ['hour',
                                                                                'day_of_week',
                                                                                'month']),
                                                                              ('scaler',
                                                                               MinMaxScaler(),
                                                                               ['amt',
                                                                                'city_pop',
                                                                                'distance_cardholder_merchant',
                                                                                'age',
                                                                                'card_prev_fraud_ratio'])],
                                                                verbose_feature_names_out=False))),
                ('smote', SMOTE(random_state=42, sampling_strategy=0.2)),
                ('tabnet', TabNetWrapper())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess',
                 DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['merchant']),
                                                                              ('job_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['job']),
                                                                              ('city_rate',
                                                                               FraudRateEncoder(min_samples=100,
                                                                                                smoothing=100),
                                                                               ['city']),
                                                                              ('state_rate',
                                                                               FraudRateEncoder(min_sampl...
                                                                                'gender']),
                                                                              ('cyclical_time',
                                                                               CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                                               'hour': 24,
                                                                                                               'month': 12}),
                                                                               ['hour',
                                                                                'day_of_week',
                                                                                'month']),
                                                                              ('scaler',
                                                                               MinMaxScaler(),
                                                                               ['amt',
                                                                                'city_pop',
                                                                                'distance_cardholder_merchant',
                                                                                'age',
                                                                                'card_prev_fraud_ratio'])],
                                                                verbose_feature_names_out=False))),
                ('smote', SMOTE(random_state=42, sampling_strategy=0.2)),
                ('tabnet', TabNetWrapper())])
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['merchant']),
                                                             ('job_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['job']),
                                                             ('city_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['city']),
                                                             ('state_rate',
                                                              FraudRateEncoder(min_samples=100,
                                                                               smoothing=100),
                                                              ['state'...
                                                             ('onehot_low',
                                                              OneHotEncoder(handle_unknown='ignore',
                                                                            sparse_output=False),
                                                              ['category',
                                                               'gender']),
                                                             ('cyclical_time',
                                                              CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                                              'hour': 24,
                                                                                              'month': 12}),
                                                              ['hour',
                                                               'day_of_week',
                                                               'month']),
                                                             ('scaler',
                                                              MinMaxScaler(),
                                                              ['amt',
                                                               'city_pop',
                                                               'distance_cardholder_merchant',
                                                               'age',
                                                               'card_prev_fraud_ratio'])],
                                               verbose_feature_names_out=False))
ColumnTransformer(transformers=[('merchant_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['merchant']),
                                ('job_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['job']),
                                ('city_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['city']),
                                ('state_rate',
                                 FraudRateEncoder(min_samples=100,
                                                  smoothing=100),
                                 ['state']),
                                ('card_freq', CardFrequency...
                                ('onehot_low',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['category', 'gender']),
                                ('cyclical_time',
                                 CyclicalTimeEncoder(period_map={'day_of_week': 7,
                                                                 'hour': 24,
                                                                 'month': 12}),
                                 ['hour', 'day_of_week', 'month']),
                                ('scaler', MinMaxScaler(),
                                 ['amt', 'city_pop',
                                  'distance_cardholder_merchant', 'age',
                                  'card_prev_fraud_ratio'])],
                  verbose_feature_names_out=False)
['merchant']
FraudRateEncoder(min_samples=100, smoothing=100)
['job']
FraudRateEncoder(min_samples=100, smoothing=100)
['city']
FraudRateEncoder(min_samples=100, smoothing=100)
['state']
FraudRateEncoder(min_samples=100, smoothing=100)
['cc_num']
CardFrequencyEncoder()
['category', 'gender']
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['hour', 'day_of_week', 'month']
CyclicalTimeEncoder(period_map={'day_of_week': 7, 'hour': 24, 'month': 12})
['amt', 'city_pop', 'distance_cardholder_merchant', 'age', 'card_prev_fraud_ratio']
MinMaxScaler()
SMOTE(random_state=42, sampling_strategy=0.2)
TabNetWrapper()
In [124]:
# Evaluate
y_pred_tabnet_smote = pipe.predict(X_test)
y_proba_tabnet_smote = pipe.predict_proba(X_test)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tabnet_smote))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tabnet_smote))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_tabnet_smote))
Confusion Matrix:
 [[540084  13490]
 [   619   1526]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99    553574
           1       0.10      0.71      0.18      2145

    accuracy                           0.97    555719
   macro avg       0.55      0.84      0.58    555719
weighted avg       1.00      0.97      0.98    555719


ROC-AUC Score: 0.9623903868991246
In [126]:
# Visualize Results
plot_model_performance(y_test, y_pred_tabnet_smote, y_proba_tabnet_smote, model_name="TabNet SMOTE - 0.2")
No description has been provided for this image
Results¶
In [134]:
y_pred_tabnet_smote02 = y_pred_tabnet_smote
y_proba_tabnet_smote02 = y_proba_tabnet_smote
In [135]:
compare_three_models(
    y_true=y_test,
    preds_1=y_pred_tabnet,
    probs_1=y_proba_tabnet,
    preds_2=y_pred_tabnet_smote01,
    probs_2=y_proba_tabnet_smote01,
    preds_3=y_pred_tabnet_smote02,
    probs_3=y_proba_tabnet_smote02,
    model_names=('TabNet', 'TabNet (SMOTE 0.1)', 'TabNet (SMOTE 0.2)')
)
   Metric  TabNet  TabNet (SMOTE 0.1)  TabNet (SMOTE 0.2)
Precision    0.83                0.55                0.10
   Recall    0.31                0.64                0.71
 F1-score    0.46                0.59                0.18
  ROC-AUC    0.95                0.98                0.96

No description has been provided for this image

Model Performance

Introducing two levels of SMOTE oversampling (0.1 and 0.2) reveals how the balance between fraud sensitivity and prediction accuracy shifts as synthetic samples increase.

  • The baseline TabNet remains highly precise (0.83) but conservative, identifying only 31% of frauds (recall = 0.31). This depicts a cautious classifier that avoids false positives but misses many frauds.

  • With SMOTE 0.1, the model becomes more balanced - recall increases substantially to 0.64, while precision decreases moderately to 0.55. The resulting F1-score of 0.59 marks the most effective compromise between precision and recall, supported by a near-perfect ROC-AUC of 0.98.

  • At SMOTE 0.2, the model becomes highly sensitive (recall peaks at 0.71) but precision collapses to 0.10, indicating many false alarms. The F1-score correspondingly drops to 0.18.

In summary:

  • If your goal is maximum precision and fewer false positives → choose TabNet (no SMOTE).

  • If you seek the best overall balance between detecting and correctly classifying fraud → choose TabNet (SMOTE 0.1).

  • If you prioritize catching as many frauds as possible, even at the cost of high false positives → choose TabNet (SMOTE 0.2).

Models Conclusion¶

Model Performance Summary

Model Precision Recall F1-score ROC-AUC
Logistic Regression 0.07 0.00 0.00 0.85
Logistic Regression + SMOTE 0.19 0.62 0.29 0.92
Random Forest 0.45 0.07 0.12 0.84
Random Forest + SMOTE 0.35 0.07 0.12 0.85

Random Forest (SMOTE + Tuned) | 0.01 | 0.09 | 0.02 | 0.68 | | XGBoost | 0.07 | 0.21 | 0.10 | 0.86 | | XGBoost + SMOTE | 0.07 | 0.27 | 0.11 | 0.87 | | XGBoost (SMOTE + Tuned) | 0.05 | 0.33 | 0.09 | 0.87 | | Neural Network | 0.28 | 0.64 | 0.39 | 0.96 | | TabNet | 0.83 | 0.31 | 0.46 | 0.95 | | TabNet + SMOTE-0.1 | 0.54 | 0.64 | 0.59 | 0.98 | | TabNet + SMOTE-0.2 | 0.1 | 0.71 | 0.18 | 0.96 |

The best result across a given colum is bolded, and the second best result is underlined.

Conclusion

Across all models evaluated, results highlight a clear evolution from simple linear approaches to advanced deep learning architectures, both in predictive strength and practical usability for fraud detection.

While Logistic Regression initially struggled with extreme class imbalance, applying SMOTE transformed it into a strong and reliable baseline, achieving a solid ROC-AUC of 0.92. This demonstrates that even straightforward models can be highly effective when supported by proper data balancing techniques.

Ensemble methods like Random Forest and XGBoost failed to improve the overall model performance, indicating difficulty in capturing the rare and complex fraud patterns within the data.

The Neural Network achieved a good balance, with strong recall (0.64) and an impressive ROC-AUC of 0.96, showing its ability to model non-linear relationships. However, it was the TabNet family of models that clearly stood out - demonstrating top-tier performance. Particularly, TabNet with SMOTE 0.1 delivered the best overall results, achieving the highest F1-score (0.59) and ROC-AUC (0.98), representing a near-optimal balance between detecting fraud and minimizing false alarms.

From a business standpoint, these findings suggest that advanced tabular deep learning models like TabNet can significantly enhance fraud detection pipelines. When paired with careful oversampling, they maximize the detection of fraudulent transactions without overwhelming analysts with false positives - leading to higher operational efficiency, improved risk management, and reduced financial loss.

Final Project Conclusion¶

This project set out to build a reliable fraud detection system using a range of machine learning and deep learning models - tackling one of the core aspects of data science: extreme class imbalance.

From the EDA phase, the dataset proved to be clean and informative, with no missing values, duplicates, or major outliers. The features appeared meaningful and potentially predictive, providing a solid foundation for modeling.

The unsupervised learning experiments (PCA, t-SNE, and K-Means) were an ambitious attempt to uncover hidden patterns, but they did not produce any actionable insights that could significantly improve model performance - an expected outcome given the highly imbalanced nature of the data.

In contrast, the feature engineering process delivered strong results. Careful experimentation - including creative additions such as the is_night feature - demonstrated clear performance gains across multiple models. Similarly, the preprocessing pipeline proved effective: techniques like fraud rate encoding for categorical features meaningfully improved discrimination between fraudulent and legitimate transactions.

Regarding modeling, several models performed well. Logistic Regression provided a surprisingly strong baseline once SMOTE was applied, showing that even simple models can yield competitive results with proper class balancing. The TabNet models emerged as the clear winners, achieving the best trade-off between precision and recall, especially when combined with SMOTE (0.1). Their strong generalization make them ideal candidates for real-world fraud detection applications.

Overall, this project demonstrates that success in fraud detection is not just about model complexity but about thoughtful preprocessing, feature design, and class balancing - with deep tabular architectures like TabNet showing exceptional promise for future deployment.

Future Work & Recommendations¶

Our exploration into credit card fraud detection underscores the potential of machine learning for tackling complex and highly imbalanced datasets. While our models achieved strong performance, particularly TabNet with SMOTE, the evolving nature of fraud demands continuous refinement.

  1. Model Optimization and Exploration

    • Apply more advanced hyperparameter tuning (for instance, Bayesian Optimization) for deeper performance improvements.

    • Explore additional architectures like LightGBM, CatBoost, or even the relatively new T-JEPA to model transaction relationships.

  2. Deep Learning for Temporal Patterns

    • Leverage RNNs or LSTMs to capture sequential dependencies in transaction data and detect behavioral shifts over time

    • Due to computational limits, this remains future work once greater GPU resources become available

  3. Feature Engineering Enhancements

    Create interaction features. For instance:

    • category x hour: Analyzes the activity of the different categories across the different hours of day. This could also add an additional layer of behavioral analysis.
    • merchant x amount: The average cash transactions of different merchants. This might help to identify potential suspitious merchants.